2026-06-17 | | Total: 51
This paper extends classical conformal frameworks for constructing prediction intervals with global marginal coverage $1-α$ to intervals that provide explicitly calibrated guarantees for the upper and lower tails separately. Focusing on split conformal prediction, we first construct lower and upper one-sided conformal intervals that achieve marginal validity, and then derive the induced two-sided interval by intersection. Theoretical results prove both tail-specific and global marginal coverage of the induced two-sided interval. Results are presented first for the exchangeable setting, where coverage has finite-sample guarantees, and then for non-exchangeable data, where guarantees are asymptotic. Simulation studies show that the proposed approach achieves improved directional calibration relative to classical two-sided intervals, especially relevant in skewed data. Finally, the benefit of the proposed framework is showcased in a financial application, where one aims for return maximization while seeking strict control on the left tail.
Whether connected units are similar because influence spreads across ties or because similar units form ties, is a long-standing problem. Contagion or influence is generically unidentified from observational network data. We consider the minimal and common setting of a single network, fixed over time, with two waves of a binary nodal outcome. Rather than positing a parametric model for network formation, we reframe identification of contagion as a selection-bias problem and develop a sensitivity framework. We define a controlled direct effect (CDE) holding a tie present while intervening on an alter's outcome. We show that the gap between the CDE and the observed connected-dyad risk ratio is governed by how strongly a latent homophily variable shifts the composition of connected dyads. Inspired by Smith-style selection-bias sensitivity analysis and the risk-ratio bounding function of Ding and VanderWeele we develop interpretable nonparametric bounds. This translates the question "is there contagion?" into the question "how strong would latent homophily have to be to explain away the observed contagion?" A simulation study characterizes the bounds' error control and power. We apply the framework to the 2008 U.S. House votes on the Troubled Asset Relief Program, identifying under which assumptions contagion is plausible.
Temporal difference (TD) learning with linear function approximation is a core method for policy evaluation. Its classical continuous-time description is an ordinary differential equation (ODE), which captures the asymptotic mean dynamics but neglects stochastic fluctuations determining the error floor. We introduce a stochastic differential equation (SDE) approximation for linear TD(0) under Markovian noise. The resulting model distinguishes the contraction dynamics governed by the projected Bellman operator from the influence of Markovian sampling. As a consequence, the model explains the constant-stepsize error floor through the interaction between Markovian long-run covariance and the contraction geometry of the projected Bellman operator.
We introduce an amortized Bayesian framework for spatial boundary detection that generalizes posterior inference across areal graphs with varying numbers of regions and diverse adjacency structures. The underlying model couples a Poisson count likelihood with a covariate-driven rule to interrupt smoothing across dissimilar neighboring areas, utilizing a directed acyclic graph autoregressive (DAGAR) prior to capture residual spatial dependence. To approximate the target posterior distribution, a neural engine is trained on simulated maps: a permutation-invariant summary network encodes graph-aware representations of the observed counts, offsets, covariates, and adjacency matrices, while a conditional normalizing flow generates the approximate posterior draws. Simulation studies demonstrate accurate parameter recovery, near-nominal interval coverage, well-calibrated posterior predictive behavior, and informative posterior boundary probabilities. Benchmarking against Markov chain Monte Carlo (MCMC) confirms close agreement regarding primary boundary evidence, and an ablation study validates the inclusion of model-guided graph summaries. Finally, applications to Glasgow respiratory disease and California lung cancer data demonstrate that a single trained neural engine can be seamlessly deployed across real-world maps with distinct graph structures, yielding boundary conclusions consistent with established localized smoothing analyses.
Alzheimer's disease is characterized by the progressive accumulation of amyloid-$β$ and tau followed years later by cognitive impairment. Despite this established motif, substantial subject-level variability exists in the age of pathological progression and the onset of cognitive symptoms. To understand the source of this variation, subjects must be aligned across heterogeneous disease timelines via frameworks that jointly model disease progression and time to cognitive impairment with reference to landmark positivity thresholds. Existing neurodegenerative disease progression models rely on restrictive parametric forms, fail to anchor disease timelines to positivity thresholds, and decouple biomarker trajectories from cognitive survival endpoints. To address these limitations, we introduce the Bayesian Threshold-Aligned Joint Disease Progression Model (B-TAJ DPM). This generative, semi-parametric framework models multivariate disease progression trajectories over latent disease timelines anchored at landmark positivity thresholds. Crucially, the framework integrates a survival model to link pathological progression to cognitive impairment. Posterior inference and posterior predictions for unseen subjects are carried out in open-source software. Simulation studies demonstrate excellent estimation accuracy and interval coverage. When applied to Alzheimer's Disease Neuroimaging Initiative data, B-TAJ DPM characterizes non-linear progression patterns, quantifies subject-level variation in positivity age, and reveals links between age of tau positivity and acceleration of cognitive impairment.
Prior to the version 1.3.1 update on CRAN in December 2025, gsynth, a popular R package for estimating Interactive Fixed Effects (IFE) models, could drastically and systematically underestimate standard errors. This underestimation would occur when two estimation options (inference = "parametric", and EM = TRUE) were used together, in which case the package would apply a parametric bootstrap procedure to Gobillon and Magnac (2016)'s IFE-EM estimator. The package ceased supporting this combination in December 2025, and the latest documentation now describes the parametric bootstrap as not suitable for use with the IFE-EM estimator due to a theoretical incompatibility. Our focus is an implementation error we identified in the pre-1.3.1 versions of gsynth: the parametric bootstrap used when EM = TRUE did not match the algorithm proposed in Xu (2017), using in-sample residuals instead of out-of-sample errors. We show that this implementation error alone can cause underestimation by orders of magnitude. We conduct an empirical Monte Carlo study using randomly assigned placebo treatments on a series of state-level panel datasets, and show that gsynth could yield high false positive rates in realistic settings. We identify three papers published in the American Political Science Review that are affected by this behavior. Reanalyzing the relevant sections of these papers, we show that (i) correcting the implementation error renders most findings insignificant, and (ii) using Xu (2017)'s Generalized Synthetic Control method in place of IFE-EM renders every finding insignificant.
Geostatistical spatial prediction for environmental processes is typically undertaken using Gaussian process models via Kriging, while machine learning (ML) algorithms are state-of-the-art for non-spatial prediction. An exciting recent fusion of these ideas imbibes traditional ML algorithms with the capacity to deal with spatial autocorrelation, leading to improved predictive performance. A range of approaches have been proposed, including fusion with Gaussian processes, observation-driven correlation structures, spatial basis functions and local geographical fitting. However, there has been no numerical comparison of their relative predictive performances, which is needed to advise environmental scientists on the optimal approach to use. This paper fills this knowledge gap, and focuses on random forests as the ML algorithm because they are more computationally and conceptually straightforward to implement than deep learning algorithms. The results from two studies are presented, the first being a controlled simulation experiment investigating whether any single approach is consistently superior across different spatial autocorrelation types. The second study focuses on the prediction of air pollution concentrations within a tuberculosis prevalence study in Blantyre, Malawi. The results show that whilst no single approach is universally superior, utilising spatial basis functions appears to perform consistently well across both the simulation and real data studies.
Causal discovery seeks to uncover the causal dependencies among variables. For this purpose, we propose an algorithm called Tensor-based Second-order Causal Discovery (TSCD). Its input is a tensor obtained from the covariance matrices of observational and interventional data. Assuming the causal dependencies follow a linear structural equation model on a directed acyclic graph (DAG), TSCD outputs the DAG and the functions on its edges, requiring only that the noise variables are uncorrelated. We also implement a version of the approach for nonlinear models. Our focus on second-order statistics (via the covariance matrices) is motivated by their statistical and computational efficiency relative to higher-order moments, their identifiability relative to first-order statistics, and that they work regardless of whether the variables are Gaussian. We show that TSCD has identifiable causal order and parameters from a number of interventions that is logarithmic in the number of variables. Experiments show that TSCD is robust to noise, competitive with existing methods, and scales to hundreds of variables.
Understanding urban mobility patterns is crucial for designing efficient and sustainable transportation systems. Motivated by an application to the municipality of Padova and its surroundings, we propose a novel statistical framework for the analysis and clustering of mobility trajectories derived from telephonic data. We introduce a compositional representation of individual movements that integrates the uncertain device location with information on the surrounding road network, encoding at each time point the proportions of different road types compatible with the observed position. This formulation naturally accounts for measurement uncertainty and yields trajectories evolving in the simplex. To model these data, we develop a state-space framework for compositional time series that captures both the telephonic measurement error and the temporal dynamics of the latent mobility process. Building on this representation, we propose a model-based clustering approach based on mixtures of state-space models to identify groups of trajectories with similar evolution. This allows us to aggregate individual movements into interpretable mobility patterns at the population level. The results of the case study demonstrate the ability of the approach to uncover meaningful mobility behaviors, providing insights that are potentially relevant to policy makers.
Constraint-based causal discovery relies on repeated conditional independence tests, but fast nonparametric tests often sacrifice calibration, especially when variables depend on the conditioning set through nonlinear relationships. We introduce BLITZ (Broad-to-Local Independence Testing via residualiZation), a nonparametric conditional independence test designed to run well under a second while maintaining the accuracy needed for the thousands of queries performed by constraint-based causal discovery algorithms. BLITZ first removes broad smooth dependence on the conditioning set using low-order polynomial regression, then applies a small nonlinear feature map and residualizes those features with shallow tree regressions. The resulting statistic tests residual cross-covariance, with a moment-matched chi-square approximation to the null distribution. We show theoretically that the two-stage design reduces the effective complexity faced by the tree residualizers, allowing shallow trees to control residual conditional-mean bias while avoiding excessive overfitting. In simulations, BLITZ provides better null calibration than fast kernel, random-feature, and regression-based competitors while remaining among the fastest methods tested. In causal discovery experiments on synthetic graphs and flow-cytometry data, BLITZ yields more reliable endpoint orientations among retained adjacencies and competitive structural recovery. These results suggest that broad-to-local residualization is a practical route to calibrated, scalable nonparametric conditional independence testing for causal discovery.
We study the privacy of releasing posterior sample paths from a Gaussian process (GP) when the entire training set including covariates and responses is private. Unlike standard differential-privacy (DP) mechanisms that add external noise, posterior sampling is random by construction. We show that this intrinsic randomness yields DP guarantees by deriving explicit Rényi-DP bounds for GP posterior sample-path release. The bounds separate posterior-mean leakage from data-dependent posterior-covariance leakage showing that meaningful privacy depends sharply on effective ridge regularisation. We apply membership-inference attacks to show that empirical leakage follows the predicted dependence on regularisation, posterior variance and the number of released posterior sample-paths. Utility experiments on downstream posterior-sampling tasks identify noisy-observation regimes where privacy-compatible regularisation preserves useful decisions with modest utility loss. When stronger privacy is needed, the intrinsic guarantee can be sharpened by adding calibrated GP noise, providing an explicit additional privacy knob.
Understanding how individual metro usage evolves over multi-year horizons is essential for transit planning and passenger retention. However, existing approaches typically characterize mobility patterns as static clusters or short-term variability, leaving the lifecycle dynamics of transit participation underexplored. This study proposes a state-based lifecycle modeling framework that integrates Hidden Semi-Markov Models (HSMM) with discrete-time survival analysis to characterize the evolution of individual metro mobility. The HSMM infers latent mobility states with explicit duration distributions and a transition matrix governing regime changes, while the survival component models exit and re-entry events via state-dependent hazard functions conditioned on mobility-state trajectories and behavioral history. Applied to four years of smart card data from the Shanghai metro system (2021-2024), the framework enables the identification of interpretable mobility states, the characterization of transition dynamics, and the quantification of state-dependent exit and re-entry processes. The analysis reveals five robust mobility states with a directional transition hierarchy centered on an occasional-usage gateway state, and fundamentally different temporal mechanisms governing disengagement and return: exit hazard is state-dependent but duration-independent, whereas re-entry hazard decays sharply with inactivity length. These findings provide a methodological foundation for lifecycle-oriented mobility analysis and practical guidance for transit operators to identify at-risk users and time retention interventions.
The influence of environmental exposures, such as air pollution, on human health has become increasingly recognized. A growing body of evidence suggests that the microbiome may mediate these effects, explaining the relationship between the environment and host biology. However, the impact of environmental exposures on the microbiome is not yet fully understood, and statistical modeling in this context is challenged by complex dependency structures. In particular, microbiome data exhibit spatial dependencies across sampling regions as well as ecological correlations among microbial taxa, which, if ignored, can substantially reduce detection power, leading to missed true signals. We introduce a novel spatial mixed modeling framework for microbiome data that accounts for both region-level spatial dependency and taxon-level ecological dependency using conditional autoregressive priors. Through simulations, we demonstrate that this framework outperforms existing methods that ignore such dependencies, by achieving high detection power in feature selection while maintaining low false positive rates and reduced mean squared error in estimation. Applied to two real studies-data from Food and Microbiome Longitudinal Investigation study and lung microbiome dataset-with fine particulate matter (PM_2.5) exposures, our model identified genera, which are known to be involved in pollution-related health outcomes, as well as novel taxa that may mediate host responses to air pollution. This novel approach offers a powerful and flexible tool for uncovering biologically meaningful associations in complex environmental data.
The nested sampling (NS) technique has gained widespread attention, particularly in cosmology and astronomy, due to its ability to efficiently explore high-likelihood regions - a feature akin to an implicit likelihood optimization that underlies its success. While the full theoretical derivation of NS is complex and involves several approximations, the central challenge lies in sampling from the likelihood-constrained priors, which is crucial for its performance. This work provides a comprehensive and detailed exposition of NS derivation, clarifying both its theoretical foundations and practical challenges. We provide a thorough description of the NS procedure, emphasizing both its strengths and potential limitations. In doing so, this work seeks to deepen understanding of the method and to foster the development of future enhancements, novel variants, and more efficient implementations across a wide range of scientific applications. Thus, the main contribution of this work is twofold: it serves both as a tutorial for newcomers to the field and as a critical review for experienced practitioners.
Subgroup analysis is routinely used in randomized controlled trials to examine whether treatment effects are homogeneous across patient subgroups or differ because of treatment-effect heterogeneity. In this paper, we investigate the properties of the odds ratio and the relative response in subgroup analyses with binary outcomes, extending previous work with new theoretical insights and methodological developments. We establish several new theorems that characterize how the odds ratio for the overall population changes in both magnitude and direction when two subgroups are combined. These results further confirm that the odds ratio is inappropriate as an efficacy measure in this subgroup setting, whereas the relative response is appropriate. We also present the formal relationship between the odds ratio and the relative response, and clarify their differences in terms of the logic-respecting property, that is, whether the overall efficacy lies between the subgroup efficacies, and the dilution property, that is, whether mixing subgroups moves the overall odds ratio toward 1. Although the odds ratio is generally not logic-respecting, it may behave approximately like a logic-respecting efficacy measure under certain conditions. To illustrate our findings, we present an illustrative example based on clinical trial data and discuss its implications for subgroup analysis in randomized controlled trials.
We develop a decision-theoretic framework for distributed Bayesian experimental design in which local agents evaluate candidate experiments using expected information gain and transmit their local design decisions to a fusion center. Unlike centralized Bayesian design, where all likelihood components and information-gain values are available to a single planner, the fusion center in the distributed setting chooses a global experiment from compressed local recommendations. We derive the Bayes-optimal fusion rule, which selects the experiment with largest conditional expected centralized information gain given the observed local design decisions. This rule is analogous in spirit to optimal fusion rules in distributed detection, but differs fundamentally because the underlying utility is expected information gain and the resulting loss is information-gain regret rather than classification error. We also establish information-loss bounds and identify conditions under which the decision-only fusion rule is asymptotically equivalent to the centralized design. Numerical experiments show that Bayes-optimal fusion closely approximates the centralized oracle, whereas majority voting can be highly suboptimal when a minority of sites carry disproportionate information.
Response-adaptive randomization (RAR) in clinical trials aims to improve ethical and statistical efficiency by dynamically allocating patients to treatments based on observed outcomes. While RAR based on a target optimal allocation have been extensively studied for two-arms settings, their extension to multi-treatment experiments ($K \geq 2$) remains theoretically fragmented, with most existing methods focusing on specific algorithms or restricted target allocations. In this paper, we introduce a unified framework for response-adaptive targeting, the $α$-Rebalancing Targeting Strategies ($α$RTS), which generalize the ERADE two-armed strategy of Hu et al. [2009]. We prove that all designs in this family share fundamental asymptotic properties: strong consistency, asymptotic normality of allocation proportions and treatment effect estimators, and asymptotic efficiency. To address sparse target regimes (where some treatments are asymptotically eliminated), we further propose $α$RTS with Forced Exploration, a variant that guarantees infinite sampling for all treatments while preserving the asymptotic guarantees. Extensive simulations illustrate the finite-sample behavior of $α$RTS variants in a 3-armed context, highlighting in particular the critical role of forced exploration in sparse settings.
Understanding how extreme price movements propagate across financial and energy markets is critical for risk management and regulatory design in the EU Emissions Trading System (EU ETS). We apply Hüsler-Reiss graphical models of extremes to a system of 20 daily variables centred on EU allowances futures across Phases 3 and 4 of the EU ETS (2013--2025), with a Gaussian graphical model as the average-dependence baseline. The tail networks are structurally distinct from the average dependence network: substantially denser, organized around different central nodes, and governed by within-sector homophily that binds sector boundaries more tightly than at the average-dependence level. EU allowances futures are peripheral in the standard graphical model but achieve the highest centrality in the tail networks, while equity indices and major FX pairs follow the opposite trajectory. Exponential random graph models confirm equity and FX peripherality in tail networks across all sample periods and identify triadic closure during market downturns as a Phase~3 phenomenon that vanishes in Phase~4. The phase transition restructures the tail network without thinning it: average dependence contracts sharply while tail dependence persists, and crash contagion shifts from clustered to diffuse propagation. These findings have direct implications for hedge construction by compliance entities, stress-test calibration by regulators, and the design of systemic-risk monitoring tools for EU ETS markets.
Explaining precipitation behavior at daily scale is important for fine scale understanding of the mechanisms driving precipitation. However, this effort is challenging because of the frequent incidence of zeros. The challenge is amplified by the acknowledged incidence of two types of zeros -- absence of precipitation as a dry event and absence of measured precipitation due to detection limits. In this work, we propose a multilevel spatio-temporal model which allows us to distinguish and explain the two types of zeros, as well as to model positive precipitation above the detection limit. The methodology combines a point mass at zero with probability modeled through a probit regression, a Gamma regression for latent positive precipitation amounts, and an observation mechanism subject to threshold-induced censoring. To capture spatial dependencies, Gaussian processes are employed in each regression model. Working within a Bayesian framework, we can obtain a rich range of inference with exact uncertainty. In particular, we provide model-based inference tools to compare and quantify differences between the true precipitation process and its observed counterpart across relevant characteristics. We apply our model to the analysis of daily spring observations at 70 sites over 15 years from the Ebro River Basin in northeastern Spain. Our findings indicate that the threshold strongly affects the occurrence of observed precipitation, especially in humid regions. While its impact on total accumulated amounts is small, it can exert a relevant effect on upper quantiles.
Graph-based learning methods have become increasingly prominent due to their strong performance across diverse applications. Among these, recent frameworks grounded in diffusion processes provide a unifying perspective that extends traditional graph neural network formulations while addressing limitations of standard message-passing mechanisms. Despite these advances, concerns remain regarding the fairness of such models, as they may propagate or amplify biases present in the data. In this work, we introduce a fairness-aware adaptation of graph-based diffusion by modifying the underlying Laplacian operator. Our approach incorporates multiple complementary transformations, including subspace projections, spectral adjustments, and frequency-based filtering, to mitigate bias-related components. Leveraging the intrinsic smoothing properties of graph diffusion, we provide a principled analysis of the resulting behavior and establish theoretical insights into fairness properties. We evaluate the proposed framework on both synthetic and real-world datasets, demonstrating that it achieves competitive performance while improving fairness metrics with limited additional computational cost.
This paper builds a hierarchy of explicit, non-asymptotic tail bounds for the supremum of the Kostlan--Shub--Smale (KSS) random field on the sphere, and applies it to two problems: Spiked Tensor PCA and the landscape of the spherical $k$-spin model. For Tensor PCA, we study the non-asymptotic statistical limits of estimating a rank-$R$ symmetric signal tensor of order~$k\ge 3$ and dimension~$d\ge 3$ from a single Gaussian observation at signal-to-noise ratio~$λ$, through the \emph{profile maximum likelihood estimator}, the MLE restricted to normalized rank-$R$ tensors of coherence at least~$κ$. Our analysis uses a single reduction: a deterministic geometric inequality (the Tube Method) and a rank-reduction step bound the estimation error by the supremum of the canonical KSS field, which the Kac--Rice formula turns into a Gaussian integral against the expected absolute characteristic polynomial of a shifted Gaussian Orthogonal Ensemble, controlled in turn by the four explicit tail bounds of our hierarchy (three from a Mehta--Fyodorov representation, one from a Ben Arous--Dembo--Guionnet large deviation). The same reduction yields two results, each with explicit constants. For estimation, a finite-$(k,d)$ error bound recovers the asymptotically optimal rate~$\sqrt{d\log k}$ of Perry, Wein and Bandeira, with explicit dependence on the rank~$R$ and the coherence~$κ$. For the landscape, a two-sided non-asymptotic bracketing of the annealed complexity of the spherical $k$-spin Hamiltonian recovers the Auffinger--Ben Arous--Černý complexity function in the high-dimensional limit.
Mediation analysis is essential for decomposing the causal effect of a treatment into direct and indirect pathways. However, many practical settings rely on the stringent assumption that recanting witnesses, defined as treatment-induced mediator-outcome confounders, are either absent or fully known a priori. Such a requirement is often untenable, especially when these variables remain unobservable due to measurement difficulties or privacy constraints. In this paper, we leverage proximal causal inference to develop three novel identification strategies to address the challenge of identifying path-specific effects in the presence of unknown recanting witnesses. Building on this, we develop a semiparametric inference framework that derives the efficient influence function and proposes a proximal multiply robust estimator, which remains consistent if at least one set of nuisance models is correctly specified. When all nuisance models are correctly specified and converge at appropriate rates, the estimator is asymptotically normal and achieves the semiparametric efficiency bound. We provide a minimax optimization-based debiased machine learning procedure for point estimation and constructing valid confidence intervals. The performance of the proposed methods is demonstrated by simulation studies and a real data application.
We develop an anytime-valid framework for optimal policy identification from logged contextual bandit data. In many applied settings, the analyst wants to select the optimal policy from a candidate policy class $Π$, but data are generated by an externally determined logging policy that they do not control. The analyst may also wish to monitor evidence continuously, stopping once the optimal policy is clear rather than committing to a fixed sample size in advance. This paper addresses these challenges by constructing a time-indexed set $S_t$ that retains the true optimal policy set uniformly over time with high probability. The resulting procedure allows the analyst to monitor policy values, eliminate clearly suboptimal policies, and stop at data-dependent times without invalidating inference. When the optimal policy is unique, we define a stopping time for its identification and derive a sample-complexity bound scaling as $O\!\left(\frac{\log |Π|+\log\log(1/Δ_{\min})}{Δ_{\min}^2}\right)$, where $Δ_{\min}$ is the gap between the best and second-best policy values. Simulations demonstrate that the anytime-valid approach can yield substantial sample savings relative to fixed-$N$ designs. An application to a large adaptive experiment on reducing misinformation online illustrates how the method provides a dynamic view as evidence on the optimal policy accumulates.
Binary data factorization is common, but real-valued methods ignore discreteness and yield hard-to-interpret factors. Boolean Matrix Factorization (BooMF) instead decomposes a binary matrix into two lower-rank binary matrices via logical AND and OR, expressing the data as a Boolean disjunction of interpretable patterns. In cancer genomics, BooMF can reveal coordinated feature changes that may drive tumor evolution, unlike rotational or additive decompositions. Most existing BooMF methods are heuristic, greedy, sensitive to initialization, prone to local optima, and do not support principled model selection or uncertainty quantification. We introduce Bayesian Boolean Matrix Factorization (BBMF), a fully conjugate generative model with sparsity-inducing priors. It enforces Boolean constraints, yields interpretable latent factors with coherent uncertainty quantification, and admits Gibbs sampling with closed-form full conditionals. Because cancer evolution often involves widespread, near-simultaneous chromosome-number changes (e.g., whole-genome duplication followed by instability and selection), Boolean factorizations capture these patterns more naturally than additive models. Applied to arm-level copy-number alteration data in multiple myeloma, where entries indicate presence/absence of chromosomal-arm amplifications, BBMF finds a small set of interpretable bicliques linking patient subsets to recurrently co-altered chromosomal arms, providing a compact, biologically meaningful summary of tumor heterogeneity and demonstrating BBMF's utility for uncovering discrete latent structure in complex binary data.
Small sample sizes pose significant challenges in regression analysis, often leading to violations of classical assumptions such as normality, homoscedasticity, and independence of residuals. These violations compromise parameter estimation accuracy, reduce statistical power, and limit the generalizability of findings. This study introduces the Gaussian Process-based Modified Extreme Value Theorem (GP-MEVT) method, a novel hybrid data augmentation approach that combines Gaussian Process with Extreme Value Theory to address these limitations. The GP-MEVT method generates augmented observations that extend the predictor space beyond the observed range while preserving the underlying linear structure and introducing controlled variability based on residual variation, through comprehensive simulation studies across three variance scenarios (sigma = 2, 5, 8) and sample sizes (n = 10, 15, 20). Here, we demonstrate that GP-MEVT achieves a higher rate of assumption satisfaction, substantially outperforming standard bootstrap and bootstrap with noise methods. The proposed method also exhibits reasonable parameter estimation accuracy, with intercept and slope estimates consistently closer to true parameter values, and maintains competitive or superior model fitting performance as measured by root mean square error. Application to a real-world dataset confirms these advantages, with GP-MEVT achieving a 67.1% assumption satisfaction rate compared to 17.3% and 21.2% for bootstrap alternatives. These findings establish GP-MEVT as a robust and reliable framework for fitting linear regression models to small datasets, offering practitioners a principled approach to statistical inference when sample size limitations are unavoidable.