2025-12-30 | | Total: 72
We introduce Iterated Bellman Calibration, a simple, model-agnostic, post-hoc procedure for calibrating off-policy value predictions in infinite-horizon Markov decision processes. Bellman calibration requires that states with similar predicted long-term returns exhibit one-step returns consistent with the Bellman equation under the target policy. We adapt classical histogram and isotonic calibration to the dynamic, counterfactual setting by repeatedly regressing fitted Bellman targets onto a model's predictions, using a doubly robust pseudo-outcome to handle off-policy data. This yields a one-dimensional fitted value iteration scheme that can be applied to any value estimator. Our analysis provides finite-sample guarantees for both calibration and prediction under weak assumptions, and critically, without requiring Bellman completeness or realizability.
We present an online method for guaranteeing calibration of quantile forecasts at multiple quantile levels simultaneously. A sequence of $α$-level quantile forecasts is calibrated if the forecasts are larger than the target value at an $α$-fraction of time steps. We introduce a lightweight method called Multi-Level Quantile Tracker (MultiQT) that wraps around any existing point or quantile forecaster to produce corrected forecasts guaranteed to achieve calibration, even against adversarial distribution shifts, while ensuring that the forecasts are ordered -- e.g., the 0.5-level quantile forecast is never larger than the 0.6-level forecast. Furthermore, the method comes with a no-regret guarantee that implies it will not worsen the performance of an existing forecaster, asymptotically, with respect to the quantile loss. In experiments, we find that MultiQT significantly improves the calibration of real forecasters in epidemic and energy forecasting problems.
Joint modeling of longitudinal and survival data has become increasingly important in medical research, particularly for understanding disease progression in chronic conditions where both repeated biomarker measurements and time-to-event outcomes are available. Traditional two-stage methods, which analyze longitudinal and survival components separately, often result in biased estimates and suboptimal predictions due to failure to account for their interdependence. In this study, we propose a Bayesian hierarchical joint modeling framework with an emphasis on predictive evaluation and clinical interpretability. The model simultaneously characterizes the longitudinal trajectory of a biomarker and the associated survival outcome through shared random effects, capturing the intrinsic association between disease dynamics and event risk. The Bayesian formulation allows flexible incorporation of prior information, accommodates irregular measurement times and missing data, and provides full posterior distributions for uncertainty quantification via credible intervals. We evaluate the proposed framework using both simulated data designed to mimic realistic patient trajectories and a real-world clinical dataset involving patients with chronic liver disease. Results demonstrate that the Bayesian joint model consistently outperforms conventional two-stage approaches in terms of parameter estimation accuracy and predictive performance, as measured by time-dependent area under the curve and Brier scores. The proposed approach provides a robust and interpretable tool for dynamic, patient-specific prognosis, supporting clinical decision-making in personalized medicine.
We investigate machine learning models for stock return prediction in non-stationary environments, revealing a fundamental nonstationarity-complexity tradeoff: complex models reduce misspecification error but require longer training windows that introduce stronger non- stationarity. We resolve this tension with a novel model selection method that jointly optimizes model class and training window size using a tournament procedure that adaptively evaluates candidates on non-stationary validation data. Our theoretical analysis demonstrates that this approach balances misspecification error, estimation variance, and non-stationarity, performing close to the best model in hindsight. Applying our method to 17 industry portfolio returns, we consistently outperform standard rolling-window benchmarks, improving out-of-sample $R^2$ by 14-23% on average. During NBER- designated recessions, improvements are substantial: our method achieves positive $R^2$ during the Gulf War recession while benchmarks are negative, and improves $R^2$ in absolute terms by at least 80bps during the 2001 recession as well as superior performance during the 2008 Financial Crisis. Economically, a trading strategy based on our selected model generates 31% higher cumulative returns averaged across the industries.
We propose a novel Bayesian optimization (BO) procedure aimed at identifying the ``profile optima'' of a deterministic black-box computer simulation that has a single control parameter and multiple nuisance parameters. The profile optima capture the optimal response values as a function of the control parameter. Our objective is to identify them across the entire plausible range of the control parameter. Classic BO, which targets a single optimum over all parameters, does not explore the entire control parameter range. Instead, we develop a novel two-stage acquisition scheme to balance exploration across the control parameter and exploitation of the profile optima, leveraging deep and shallow Gaussian process surrogates to facilitate uncertainty quantification. We are motivated by a computer simulation of a diffuser in a rotating detonation combustion engine, which returns the energy lost through diffusion as a function of various design parameters. We aim to identify the lowest possible energy loss as a function of the diffuser's length; understanding this relationship will enable well-informed design choices. Our ``profile Bayesian optimization'' procedure outperforms traditional BO and profile optimization methods on a variety of benchmarks and proves effective in our motivating application.
Bayesian profile regression mixture models (BPRM) allow to assess a health risk in a multi-exposed population. These mixture models cluster individuals according to their exposure profile and their health risk. However, their results, based on Monte-Carlo Markov Chain (MCMC) algorithms, turned out to be unstable in different application cases. We suppose two reasons for this instability. The MCMC algorithm can be trapped in local modes of the posterior distribution and the choice of post-treatment procedures used on the output of the MCMC algorithm leads to different clustering structures. In this work, we propose improvements of the MCMC algorithms proposed in previous works in order to avoid the local modes of the posterior distribution while reducing the computation time. We also carry out a simulation study to compare the performances of the MCMC algorithms and different post-processing in order to provide guidelines on their use. An application in radiation epidemiology is considered.
We tackle the challenge of estimating grouping structures and factor loadings in asset pricing models, where traditional regressions struggle due to sparse data and high noise. Existing approaches, such as those using fused penalties and multi-task learning, often enforce coefficient homogeneity across cross-sectional units, reducing flexibility. Clustering methods (e.g., spectral clustering, Lloyd's algorithm) achieve consistent recovery under specific conditions but typically rely on a single data source. To address these limitations, we introduce the Panel Coupled Matrix-Tensor Clustering (PMTC) model, which simultaneously leverages a characteristics tensor and a return matrix to identify latent asset groups. By integrating these data sources, we develop computationally efficient tensor clustering algorithms that enhance both clustering accuracy and factor loading estimation. Simulations demonstrate that our methods outperform single-source alternatives in clustering accuracy and coefficient estimation, particularly under moderate signal-to-noise conditions. Empirical application to U.S. equities demonstrates the practical value of PMTC, yielding higher out-of-sample total $R^2$ and economically interpretable variation in factor exposures.
Accurate spatial interpolation of the air quality index (AQI), computed from concentrations of multiple air pollutants, is essential for regulatory decision-making, yet AQI fields are inherently non-Gaussian and often exhibit complex nonlinear spatial structure. Classical spatial prediction methods such as kriging are linear and rely on Gaussian assumptions, which limits their ability to capture these features and to provide reliable predictive distributions. In this study, we propose \textit{deep classifier kriging} (DCK), a flexible, distribution-free deep learning framework for estimating full predictive distribution functions for univariate and bivariate spatial processes, together with a \textit{data fusion} mechanism that enables modeling of non-collocated bivariate processes and integration of heterogeneous air pollution data sources. Through extensive simulation experiments, we show that DCK consistently outperforms conventional approaches in predictive accuracy and uncertainty quantification. We further apply DCK to probabilistic spatial prediction of AQI by fusing sparse but high-quality station observations with spatially continuous yet biased auxiliary model outputs, yielding spatially resolved predictive distributions that support downstream tasks such as exceedance and extreme-event probability estimation for regulatory risk assessment and policy formulation.
Gaussian process-based models are attractive for estimating heterogeneous treatment effects (HTE), but their computational cost limits scalability in causal inference settings. In this work, we address this challenge by extending Patchwork Kriging into the causal inference framework. Our proposed method partitions the data according to the estimated propensity score and applies Patchwork Kriging to enforce continuity of HTE estimates across adjacent regions. By imposing continuity constraints only along the propensity score dimension, rather than the full covariate space, the proposed approach substantially reduces computational cost while avoiding discontinuities inherent in simple local approximations. The resulting method can be interpreted as a smoothing extension of stratification and provides an efficient approach to HTE estimation. The proposed method is demonstrated through simulation studies and a real data application.
Uncertainty quantification is essential for scientific analysis, as it allows for the evaluation and interpretation of variability and reliability in complex systems and datasets. In their original form, multivariate statistical regression models (partial least-squares regression, PLS, principal component regression, PCR) along with their kernelized versions (kernel partial least-squares regression, K-PLS, kernel principal component regression, K-PCR), do not incorporate uncertainty quantification as part of their output. In this study, we propose a method inspired by conformal inference to estimate and calibrate the uncertainty of multivariate statistical models. The result of this method is a point prediction accompanied by prediction intervals that depend on the input data. We tested the proposed method on both traditional and kernelized versions of PLS and PCR. The method is demonstrated using synthetic data, as well as laboratory near-infrared (NIR) and airborne hyperspectral regression models for estimating functional plant traits. The model was able to successfully identify the uncertain regions in the simulated data and match the magnitude of the uncertainty. In real-case scenarios, the optimised model was not overconfident nor underconfident when estimating from test data: for example, for a 95% prediction interval, 95% of the true observations were inside the prediction interval.
This paper develops a general approach for deep learning for a setting that includes nonparametric regression and classification. We perform a framework from data that fulfills a generalized Bernstein-type inequality, including independent, $φ$-mixing, strongly mixing and $\mathcal{C}$-mixing observations. Two estimators are proposed: a non-penalized deep neural network estimator (NPDNN) and a sparse-penalized deep neural network estimator (SPDNN). For each of these estimators, bounds of the expected excess risk on the class of Hölder smooth functions and composition Hölder functions are established. Applications to independent data, as well as to $φ$-mixing, strongly mixing, $\mathcal{C}$-mixing processes are considered. For each of these examples, the upper bounds of the expected excess risk of the proposed NPDNN and SPDNN predictors are derived. It is shown that both the NPDNN and SPDNN estimators are minimax optimal (up to a logarithmic factor) in many classical settings.
Causal inference is a key research area in machine learning, yet confusion reigns over the tools needed to tackle it. There are prevalent claims in the machine learning literature that you need a bespoke causal framework or notation to answer causal questions. In this paper, we want to make it clear that you \emph{can} answer any causal inference question within the realm of probabilistic modelling and inference, without causal-specific tools or notation. Through concrete examples, we demonstrate how causal questions can be tackled by writing down the probability of everything. Lastly, we reinterpret causal tools as emerging from standard probabilistic modelling and inference, elucidating their necessity and utility.
Intrinsic Gaussian fields are used in many areas of statistics as models for spatial or spatio-temporal dependence, or as priors for latent variables. However, there are two major gaps in the literature: first, the number and flexibility of existing intrinsic models are very limited; second, theory, fast inference, and software are currently underdeveloped for intrinsic fields. We tackle these challenges by introducing the new flexible class of intrinsic Whittle--Matérn Gaussian random fields obtained as the solution to a stochastic partial differential equation (SPDE). Exploiting sparsity resulting from finite-element approximations, we develop fast estimation and simulation methods for these models. We demonstrate the benefits of this intrinsic SPDE approach for the important task of kriging under extrapolation settings. Leveraging the connection of intrinsic fields to spatial extreme value processes, we translate our theory to an SPDE approach for Brown--Resnick processes for sparse modeling of spatial extreme events. This new paradigm paves the way for efficient inference in unprecedented dimensions. To demonstrate the wide applicability of our new methodology, we apply it in two very different areas: a longitudinal study of renal function data, and the modeling of marine heat waves using high-resolution sea surface temperature data.
Link prediction, a foundational task in complex network analysis, has extensive applications in critical scenarios such as social recommendation, drug target discovery, and knowledge graph completion. However, existing evaluations of algorithmic often rely on experiments conducted on a limited number of networks, assuming consistent performance rankings across domains. Despite the significant disparities in generative mechanisms and semantic contexts, previous studies often improperly highlight ``universally optimal" algorithms based solely on naive average over networks across domains. This paper systematically evaluates 12 mainstream link prediction algorithms across 740 real-world networks spanning seven domains. We present substantial empirical evidence elucidating the performance of algorithms in specific domains. This findings reveal a notably low degree of consistency in inter-domain algorithm rankings, a phenomenon that stands in stark contrast to the high degree of consistency observed within individual domains. Principal Component Analysis shows that response vectors formed by the rankings of the 12 algorithms cluster distinctly by domain in low-dimensional space, thus confirming domain attributes as a pivotal factor affecting algorithm performance. We propose a metric called Winner Score that could identify the superior algorithm in each domain: Non-Negative Matrix Factorization for social networks, Neighborhood Overlap-aware Graph Neural Networks for economics, Graph Convolutional Networks for chemistry, and L3-based Resource Allocation for biology. However, these domain-specific top-performing algorithms tend to exhibit suboptimal performance in other domains. This finding underscores the importance of aligning an algorithm's mechanism with the network structure.
Conformal prediction (CP) is widely presented as distribution-free predictive inference with finite-sample marginal coverage under exchangeability. We argue that CP is best understood as a rank-calibrated descendant of the Fisher-Dempster-Hill fiducial/direct-probability tradition rather than as Bayesian conditioning in disguise. We establish four separations from coherent countably additive predictive semantics. First, canonical conformal constructions violate conditional extensionality: prediction sets can depend on the marginal design P(X) even when P(Y|X) is fixed. Second, any finitely additive sequential extension preserving rank calibration is nonconglomerable, implying countable Dutch-book vulnerabilities. Third, rank-calibrated updates cannot be realized as regular conditionals of any countably additive exchangeable law on Y^infty. Fourth, formalizing both paradigms as families of one-step predictive kernels, conformal and Bayesian kernels coincide only on a Baire-meagre subset of the space of predictive laws. We further show that rank- and proxy-based reductions are generically Blackwell-deficient relative to full-data experiments, yielding positive Le Cam deficiency for suitable losses. Extending the analysis to prediction-powered inference (PPI) yields an analogous message: bias-corrected, proxy-rectified estimators can be valid as confidence devices while failing to define transportable belief states across stages, shifts, or adaptive selection. Together, the results sharpen a general limitation of wrappers: finite-sample calibration guarantees do not by themselves supply composable semantics for sequential updating or downstream decision-making.
This paper presents a test for wide-sense stationarity (WSS) based on the geometry of the covariance function. We estimate local patches of the covariance surface and then check whether the directional derivative in the $(1,1,0)$ direction is zero on each patch. The method only requires the covariance function to be locally smooth and does not assume stationarity in advance. It can be applied to general stochastic dynamical systems and provides a time-resolved view. We apply the test method to an SDOF system and to a stochastic Duffing oscillator. These examples show that the method is numerically stable and can detect departures from WSS in practice.
Estimating covariance matrices with high-dimensional complex data presents significant challenges, particularly concerning positive definiteness, sparsity, and numerical stability. Existing robust sparse estimators often fail to guarantee positive definiteness in finite samples, while subsequent positive-definite correction can degrade sparsity and lack explicit control over the condition number. To address these limitations, we propose a novel robust and well-conditioned sparse covariance matrix estimator. Our key innovation is the direct incorporation of a condition number constraint within a robust adaptive thresholding framework. This constraint simultaneously ensures positive definiteness, enforces a controllable level of numerical stability, and preserves the desired sparse structure without resorting to post-hoc modifications that compromise sparsity. We formulate the estimation as a convex optimization problem and develop an efficient alternating direction algorithm with guaranteed convergence. Theoretically, we establish that the proposed estimator achieves the minimax optimal convergence rate under the Frobenius norm. Comprehensive simulations and real-data applications demonstrate that our method consistently produces positive definite, well-conditioned, and sparse estimates, and achieves comparable or superior numerical stability to eigenvalue-bound methods while requiring less tuning parameters.
Federated Learning (FL) is a distributed machine learning setting that requires multiple clients to collaborate on training a model while maintaining data privacy. The unaddressed inherent sparsity in data and models often results in overly dense models and poor generalizability under data and client participation heterogeneity. We propose FL with an L0 constraint on the density of non-zero parameters, achieved through a reparameterization using probabilistic gates and their continuous relaxation: originally proposed for sparsity in centralized machine learning. We show that the objective for L0 constrained stochastic minimization naturally arises from an entropy maximization problem of the stochastic gates and propose an algorithm based on federated stochastic gradient descent for distributed learning. We demonstrate that the target density (rho) of parameters can be achieved in FL, under data and client participation heterogeneity, with minimal loss in statistical performance for linear and non-linear models: Linear regression (LR), Logistic regression (LG), Softmax multi-class classification (MC), Multi-label classification with logistic units (MLC), Convolution Neural Network (CNN) for multi-class classification (MC). We compare the results with a magnitude pruning-based thresholding algorithm for sparsity in FL. Experiments on synthetic data with target density down to rho = 0.05 and publicly available RCV1, MNIST, and EMNIST datasets with target density down to rho = 0.005 demonstrate that our approach is communication-efficient and consistently better in statistical performance.
For learned models to be trustworthy, it is essential to verify their robustness to perturbations in the training data. Classical approaches involve uncertainty quantification via confidence intervals and bootstrap methods. In contrast, recent work proposes a more stringent form of robustness: stability to the removal of any subset of $k$ samples from the training set. In this paper, we present a theoretical study of this criterion for ordinary least squares (OLS). Our contributions are as follows: (1) Given $n$ i.i.d. training samples from a general misspecified model, we prove that with high probability, OLS is robust to the removal of any $k \ll n $ samples. (2) For data of dimension $p$, OLS can withstand up to ${k\ll \sqrt{np}/\log n}$ sample removals while remaining robust and achieving the same error rate as OLS applied to the full dataset. Conversely, if $k$ is proportional to $n$, OLS is provably non-robust. (3) We revisit prior analyses that found several econometric datasets to be highly non-robust to sample removals. While this appears to contradict our results in (1), we demonstrate that the sensitivity is due to either heavy-tailed responses or correlated samples. Empirically, this sensitivity is considerably attenuated by classical robust methods, such as linear regression with a Huber loss.
As generative AI becomes increasingly embedded in everyday life, the thoughtful and intentional integration of AI-based tools into statistics education has become essential. We address this need with a focus on homework assignments and we propose the use of LLMs as a companion to complete homework by developing an open-source tool named LLteacher. This LLM-based tool preserves learning processes and it guides students to engage with AI in ways that support their learning, while ensuring alignment with course content and equitable access. We illustrate LLteacher's design and functionality with examples from an undergraduate Statistical Computing course in R, showing how it supports two distinct pedagogical goals: recalling prior knowledge and discovering new concepts. While this is an initial version, LLteacher demonstrates one possible pathway for integrating generative AI into statistics courses, with strong potential for adaptation to other types of classes and assignments.
High-dimensional Bayesian procedures often exhibit behavior that is effectively low dimensional, even when the ambient parameter space is large or infinite-dimensional. This phenomenon underlies the success of shrinkage priors, regularization, and approximate Bayesian methods, yet it is typically described only informally through notions such as sparsity, intrinsic dimension, or degrees of freedom. In this paper we introduce the \emph{Bayesian effective dimension}, a model- and prior-dependent quantity defined through the mutual information between parameters and data. This notion quantifies the expected information gain from prior to posterior and provides a coordinate-free measure of how many directions in parameter space are statistically learnable at a given sample size. In regular parametric models the effective dimension coincides with the usual parameter dimension, while in high-dimensional, ill-posed, or strongly regularized settings it can be substantially smaller. We develop basic properties of the effective dimension and present explicit calculations for Gaussian location models and linear models with general design, revealing close connections with spectral complexity and effective rank. These examples illustrate how shrinkage and regularization mechanisms directly control the growth of effective dimension. The framework offers a unifying perspective on dimension reduction in Bayesian inference and provides insight into uncertainty quantification and the behavior of approximate posterior distributions.
Cold standby 1-out-of-n redundant systems are well-established models in system reliability engineering. To date, reliability analyses of such systems have predominantly assumed exponential, Erlang, or Weibull failure distributions for their components. The Lindley distribution and its generalizations represent a significant class of statistical distributions in reliability engineering. Certain generalized Lindley distributions, due to the appealing characteristics of their hazard functions, can serve as suitable alternatives to other well-known lifetime distributions like the Weibull. This study investigates the reliability of a 1-out-of-n cold standby redundant system with perfect and imperfect switching, assuming that the active component failure times follow the Generalized Lindley distribution. We derive a closed-form expression for the system reliability. To achieve this, the distribution of the sum of n independent and identically distributed random variables following the Generalized Lindley distribution is first determined using the moment-generating function approach.
We consider problems of parameter estimation where design variables can be actively optimized to maximize information gain. To this end, we introduce JADAI, a framework that jointly amortizes Bayesian adaptive design and inference by training a policy, a history network, and an inference network end-to-end. The networks minimize a generic loss that aggregates incremental reductions in posterior error along experimental sequences. Inference networks are instantiated with diffusion-based posterior estimators that can approximate high-dimensional and multimodal posteriors at every experimental step. Across standard adaptive design benchmarks, JADAI achieves superior or competitive performance.
As AI systems are increasingly used to guide decisions, it is essential that they follow ethical principles. A core principle in medicine is non-maleficence, often equated with ``do no harm''. A formal definition of harm based on counterfactual reasoning has been proposed and popularized. This notion of harm has been promoted in simple settings with binary treatments and outcomes. Here, we highlight a problem with this definition in settings involving multiple treatment options. Illustrated by an example with three tuberculosis treatments (say, A, B, and C), we demonstrate that the counterfactual definition of harm can produce intransitive results: B is less harmful than A, C is less harmful than B, yet C is more harmful than A when compared pairwise. This intransitivity poses a challenge as it may lead to practical (clinical) decisions that are difficult to justify or defend. In contrast, an interventionist definition of harm based on expected utility forgoes counterfactual comparisons and ensures transitive treatment rankings.
The Lindley distribution was first introduced by Lindley in 1958 for Bayesian computations. Over the past years, various generalizations of this distribution have been proposed by different authors. The generalized Lindley distributions sometimes have many parameters, and although they show good flexibility, their statistical form becomes complicated. In this article, we propose a new and simple distribution determined by the recursive relation of the Lindley distribution and the Gamma distribution with specific weights. Subsequently, some statistical properties of this distribution are examined, and with real numerical examples, its superiority over the Lindley generalizations is demonstrated.