Statistics

2026-05-27 | | Total: 76

#1 Beyond average warming: Two-sample inference for dense-sparse functional data reveals changes in intraday temperature patterns [PDF] [Copy] [Kimi] [REL]

Authors: Kevin Wilk, Hajo Holzmann

Modern weather stations in Germany record daily temperatures every 10 minutes, whereas measurements from historical reference periods are often only available at much coarser temporal resolutions, typically hourly. This discrepancy must be accounted for when comparing historical and current daily temperature patterns. Motivated by this problem, we develop two-sample inference procedures for functional data under sampling schemes where one sample is densely observed while the other is relatively sparse. Building on recent ideas from transfer learning for functional data, we derive estimators of the difference of the mean functions that attain optimal convergence rates in the supremum norm. We further establish a functional central limit theorem in the space of continuous functions and develop multiplier bootstrap methods for constructing uniform confidence bands. Extensions to functional time series are also discussed. Applying the proposed methodology to daily temperature curves from German weather stations, analyzed separately by month, reveals that climate change has altered not only average temperatures but also intraday temperature patterns. In particular, for stations such as Berlin, warming from morning to early afternoon exceeds the daily average increase, whereas evening and nighttime temperatures exhibit comparatively smaller increases.

Subject: Applications

Publish: 2026-05-26 17:42:48 UTC


#2 Two-Phase Sampling Designs and Analysis Approaches for Ordinal Outcomes [PDF] [Copy] [Kimi] [REL]

Authors: Yunbi Nam, Nathan I. Shapiro, Eric P. Schmidt, Wesley H. Self, Ran Tao, Jonathan S. Schildcrout

Modern clinical trials and cohort studies gather low-cost data on all participants but may have limited resources to assess expensive exposures such as biomarkers or genomic data. When interest lies in associations involving expensive exposures, two-phase designs provide a cost-effective framework by using information available on all participants to guide the targeted selection of a subset for additional measurements. We extend this framework to studies with ordinal outcomes, a common yet previously unexplored setting. We propose three outcome-informed phase 2 sampling designs -- outcome-dependent sampling (ODS), covariate-stratified ODS, and residual-dependent sampling -- that leverage phase 1 data to enrich phase 2 selection with informative subjects. We then develop analysis methods for valid and efficient estimation/inference, including conditional likelihood methods with ascertainment-corrected maximum likelihood estimation, multiple imputation, and a full likelihood method using sieve maximum likelihood estimation. Across a range of scenarios, simulation studies show that the proposed methods substantially improve efficiency over simple random sampling with standard maximum likelihood estimation. We further demonstrate their practical utility by examining the association between interleukin-6 and a four-level clinical status outcome -- discharged, hospitalized but not in the ICU, hospitalized in the ICU, and death -- 14 days after randomization into the Crystalloid Liberal or Vasopressors Early Resuscitation in Sepsis trial.

Subject: Methodology

Publish: 2026-05-26 17:38:19 UTC


#3 Causally-interpretable meta-analysis using aggregate data [PDF] [Copy] [Kimi] [REL]

Authors: Qingyang Shi, Wouter van Amsterdam, Sacha la Bastide-van Gemert, Talitha Feenstra, Issa J. Dahabreh

Evidence syntheses and meta-analyses are used to inform clinical practice guidelines and health economic evaluations. However, heterogeneity of treatment effects poses a significant challenge. Conventional meta-analysis addresses heterogeneity through random-effect assumptions, which are not supported by design and lead to estimates that may not apply to any real-world population. Causally-interpretable meta-analysis (CIMA) offers a rigorous framework for specification, identification, and estimation of causal effects when combining information from multiple randomized trials. Initial development of CIMA focused on using individual data from randomized trials, but such data are often unavailable in practice. Here, we propose a new version of CIMA that only requires aggregate data from trials, addressing the limitations of traditional meta-analysis methods while relying only on aggregate data. The method leverages the trials' reported estimates of marginal and one-at-a-time subgroup treatment effects and descriptive statistics for baseline covariates to build moment equations for identifying and estimating a parametric conditional average treatment effect (CATE) function. The average treatment effect in a new target population is obtained by marginalizing the CATE function over the individual covariate data that defines the target population. The method can also be used to obtain causally-interpretable indirect treatment comparisons in the target population. We establish the asymptotic properties of the method, assess its finite-sample performance in simulation studies, and illustrate the application of the method by re-analyzing a published meta-analysis for SGLT2 inhibitors in patients with heart failure.

Subject: Methodology

Publish: 2026-05-26 16:48:01 UTC


#4 Inverse Control Constrained Optimization of Vessel Speed Decisions Under Environmental Risk: Evidence from Arctic Shipping [PDF] [Copy] [Kimi] [REL]

Authors: Mauli Pant, Linda Fernandez, Indranil Sahoo

Understanding how decision makers balance operational efficiency with environmental and ecological risks is central to vessel navigation. We model vessel speed as a control variable in a constrained optimization framework in which vessel operators balance multiple competing objectives, including transit efficiency, ice related navigational risk, and whale related ecological risk. The underlying risk parameters are estimated using over 14 million Automatic Identification System (AIS) observations from the United States Arctic (2010-2019), together with environmental covariates and spatially explicit whale density estimates. The framework incorporates a nonlinear risk objective, vessel heterogeneity, and regularization to ensure stable and interpretable results.The inferred trade offs reveal distinct decision making patterns across vessel groups and navigational statuses. Vessel types such as Tug Tow and Cargo balance operational speed with environmental and ecological considerations. In contrast, several vessel groups, including Fishing, Passenger, and Unspecified vessels, are strongly influenced by ice related risk, while Pleasure Craft and Tankers exhibit higher sensitivity to whale related risk. Across navigational status categories, similar heterogeneity is observed. The dominant status, under way using engine, displays a clear trade off, whereas other statuses, such as aground and undefined, are strongly shaped by ice related constraints. Statuses including restricted maneuverability and engaged in fishing exhibit higher estimated sensitivity to whale related risk, though with substantial uncertainty.Sensitivity analysis indicates that increasing whale-related risk weighting produces limited changes in model-implied optimal speed, whereas increasing ice-related risk leads to more consistent reductions.

Subjects: Applications , Machine Learning

Publish: 2026-05-26 16:46:27 UTC


#5 An Entropy-Energy Identity for Predictive Kullback-Leibler Regret in Infinitely Divisible Location Models [PDF] [Copy] [Kimi] [REL]

Authors: Kōsaku Takanashi, Kenichiro McAlinn

We consider predictive density estimation under logarithmic score for $d$-dimensional infinitely divisible location models. Taking the formal Bayes predictive density under the Lebesgue prior as a benchmark, we study the Kullback-Leibler regret of competing Bayes predictive densities. Our main contribution is an exact entropy-energy identity: the integrated regret of a Bayes predictive density $\hat{p}^π$ under prior $π$ relative to the benchmark admits an exact representation as the Dirichlet-form energy of the square-rooted marginal distribution $\sqrt{M^π}$ for the symmetric Markov semigroup induced by the benchmark kernel. This converts regret comparisons into a potential-theoretic problem and yields a sharp recurrence/transience characterization of when the benchmark predictive density can or cannot be uniformly improved. We introduce an $\mathcal{A}$-harmonic class of improper priors -- defined through the generator $\mathcal{A}$ of the induced process -- and give explicit tail conditions -- an integral test on the induced marginal, equivalent to power-law prior decay in heavy-tailed models -- that guarantee admissibility of the resulting Bayes predictive density. We illustrate the theory with new results for several distributions.

Subjects: Statistics Theory , Probability

Publish: 2026-05-26 16:30:15 UTC


#6 Space-filling foldover designs for order-of-addition experiments under Kendall tau distance criteria [PDF] [Copy] [Kimi] [REL]

Authors: Hui Shao, Yaping Wang, Qian Xiao

Order-of-addition experiments arise when the response depends on the order in which a set of components is added. Since the number of possible orders increases factorially with the number of components, full permutation designs are rarely feasible except for small problems. This paper studies space-filling fractional designs for order-of-addition experiments based on the Kendall tau distance, a natural metric for comparing permutations through pairwise ordering disagreements. We consider the maximin Kendall tau distance criterion and related dispersion criteria, and establish their connections with statistical optimality under the pairwise ordering model and a Gaussian process model with the Mallows kernel. To construct such designs, we propose an efficient foldover simulated annealing algorithm, denoted by FSA-KD, based on swap moves in the permutation space, together with foldover and incremental updating strategies. Numerical studies show that the resulting FSA-KD designs have large minimum pairwise Kendall tau distances, denoted by k_min(D), and stable pairwise distance distributions, and perform well in surrogate modeling and permutation-based optimization tasks.

Subject: Methodology

Publish: 2026-05-26 16:26:25 UTC


#7 Posterior Quantification of Borrowing from Multiple Historical Control Data in Bayesian Dynamic Borrowing Methods: A Scoping Review [PDF] [Copy] [Kimi] [REL]

Authors: Tomohiro Ohigashi, Wataru Murasaki, Masahiko Gosho

Bayesian dynamic borrowing methods incorporate historical control data into current clinical trial analyses while allowing the degree of borrowing to depend on the compatibility between historical and current data. Although many methods have been proposed, the degree of borrowing is often difficult to interpret, especially when multiple historical control sources are available. This scoping review focuses on posterior quantification of borrowing from multiple historical controls. We discuss overall borrowing summaries based on effective historical sample size, together with method-specific source-level summaries of borrowing, information contribution, or compatibility arising from power priors, unit information priors, multisource exchangeability models, Dirichlet process mixture models, and potential bias models. We distinguish posterior borrowing measures from quantities describing prior information allocation or source-specific conflict. Two case studies, one with a binary endpoint and one with a continuous endpoint, illustrate that methods with broadly similar posterior treatment effect estimates may differ in both the overall amount and source-specific pattern of borrowing. These examples show that large overall borrowing may reflect selective borrowing from compatible historical sources rather than uniform borrowing from all sources. We recommend reporting treatment effect estimates together with overall and source-specific borrowing summaries, when available, to improve transparency in posterior inference.

Subjects: Methodology , Applications

Publish: 2026-05-26 15:36:28 UTC


#8 Convergence Rates of Ordering, Testing and Estimation Procedures for Graphons With Fast Boundary Decay Rates [PDF] [Copy] [Kimi] [REL]

Authors: Jeannette Janssen, Na Lin, Aaron Smith

In latent-position random graph models (LPMs), latent vertex positions $U_{1},\ldots,U_{n}$ are sampled from some distribution on a latent space $Ω$, then edges of an observed graph $G = ([n],E)$ are sampled with some probability $\mathbb{P}[(i,j) \in E ]=w(U_i,U_j)$ that depends on the unobserved latent positions. LPMs are ubiquitous in the statistical analysis of networks, offering models that have good empirical performance, strong theoretical guarantees, and tractable algorithms. The special case $Ω= [0,1]$ is important, as it corresponds to graphs with temporal or preference-based structure. In this paper, we study three problems related to LPMs with latent space $[0,1]$: \textit{ordering} the vertices according to the latent positions, \textit{estimating} the generating graphon $w$, and \textit{testing} whether an observed graph $G$ could have come from an LPM with state space $[0,1]$. Our results on the ordering problem greatly generalize two observations of Janssen/Smith (2022): (i) for \textit{some} families of graphons, the best estimate of the ordering converges much faster than the usual statistical rate of $\frac{1}{\sqrt{n}}$, and (ii) this occurs even though, for the same families of graphons, the best estimate of the latent positions still occurs at the usual $\frac{1}{\sqrt{n}}$ rate. As a main consequence, we develop a computationally-efficient graphon-estimation algorithm and show that it has the same convergence rate as the non-explicit optimal algorithm of Gao et al (2015). We also derive and analyze a testing procedure.

Subjects: Statistics Theory , Probability

Publish: 2026-05-26 15:28:05 UTC


#9 Bernstein-von Mises Theorem for Sparse Generalized Linear Model [PDF1] [Copy] [Kimi] [REL]

Authors: Hanqing Li, Xuewen Lu

We study spike-and-slab priors for generalized linear models with possible grouped sparsity. The main result is an oracle Bernstein--von Mises theorem for the fractional posterior under supportwise likelihood assumptions. The proof develops sparse local asymptotic normality and Laplace approximation around support-specific pseudo-true centers, and combines them with fixed-prior mass, support penalization, recovery geometry, and beta-min separation to obtain contraction, support recovery, Gaussian mixture approximation, and collapse to the oracle Gaussian law. Model-entry verifications are given for Gaussian regression and for logistic, Poisson, probit, Gamma log-link, and negative-binomial log-link regression under stated sufficient conditions. The ordinary posterior is treated only through restricted Gaussian and canonical-link extensions, with coverage under additional active-dimension and moment conditions.

Subjects: Statistics Theory , Methodology

Publish: 2026-05-26 15:06:11 UTC


#10 Copula and spatial-regularized variational autoencoder for mapping disease comorbidity in West Africa [PDF] [Copy] [Kimi] [REL]

Authors: Osafu Augustine Egbon, Bassey David Ita, Faith Eshofonie, Ezra Gayawan

Geospatial health disproportionality remains a critical public health concern, as communities face heterogeneous illness risks due to varying exposures to adverse socioeconomic and environmental conditions. While statistical models have been adopted to identify risk factors, studies that account for the complex, non-linear dependencies and spatial regularities inherent in comorbid disease patterns are underdeveloped. In this work, we propose a novel spatially regularized variational autoencoder (VAE) to characterize and map the geospatial disproportion of childhood comorbidity in West Africa, focusing on diarrhea, fever, and acute respiratory infection (ARI). To model dependence between these conditions, this study integrates a bivariate Gumbel copula into the VAE framework, enabling flexible modeling of asymmetric dependence and quantification of joint and conditional morbidity risks. Additionally, covariate effects within the framework were quantified to facilitate epidemiological interpretation of risk factors. The proposed method was benchmarked against commonly used methods and applied to characterize comorbidity in West Africa using the Demographic and Health Survey data. Findings reveal pronounced spatial heterogeneity in the likelihood of comorbidity among West African children, with the strongest co-occurrence observed between fever and ARI. Household wealth, maternal education, and access to improved water sources were associated with the likelihood of comorbidity. These patterns highlight high-risk areas and underscore the need for targeted, location-specific public health interventions.

Subject: Methodology

Publish: 2026-05-26 14:55:51 UTC


#11 Gaussian Process-based learning with new MCMC-based implementation of Wishart prior on correlation matrix [PDF] [Copy] [Kimi] [REL]

Authors: Kane Warrior, Dalia Chakrabarty

In probabilstic supervised learning of an input-output relationship - as a sample function of a Gaussian Process (GP) - priors are typically specified for the hyperparameters of the kernel that parametrises the covariance function of the GP, where the induced covariance matrix of the (resulting multivariate Normal) likelihood, governs the learning and prediction. When the sought function is highly multivariate, multiple lengthscale parameters must be learnt simultaneously, making inference difficult. We develop a ``self-assembled'' Wishart prior for the covariance matrix, while undertaking Bayesian inference on the kernel hyperparameters using MCMC. The construction uses a look-back window over recent MCMC iterations to define a time-step dependent scale matrix, thereby introducing adaptiveness to the chain. Results suggest that direct prior specification on the covariance matrix can be useful for diagnosing weakly informative inputs within the GP-based learning paradigm. We support our prior development with two distinct empirical illustrations - one on synthetic data, and another on a real-world dataset.

Subjects: Machine Learning , Machine Learning

Publish: 2026-05-26 14:37:18 UTC


#12 Estimation and Inference for Win Measures with Multiple Ordinal Endpoints Subject to Missingness [PDF] [Copy] [Kimi] [REL]

Authors: Yi Liu, Huiman Barnhart, Sean O'Brien, Yuliya Lokhnygina, Roland A. Matsouaka

Win measures, including the win ratio (WR), win odds (WO), net benefit (NB), and desirability of outcome ranking (DOOR), are increasingly used in randomized clinical trials with multiple hierarchical ordinal endpoints. In practice, however, one or more component endpoints may have missing data. The standard pairwise-comparison approach, which treats pairs with missing outcomes as ties, can produce biased estimates, even if the data are missing completely at random (MCAR). Although inverse probability of censoring weighting (IPCW) methods have been developed for censored survival endpoints, corresponding methods for addressing missing hierarchical ordinal endpoints are not yet available. To address this gap, we develop inverse probability weighting (IPW) and augmented IPW (AIPW) estimators for win measures with hierarchical ordinal endpoints subject to missing data, allowing missingness to depend on treatment assignment and baseline covariates. The IPW estimator corrects bias by reweighting complete observed outcomes using joint non-missingness probabilities involved in estimating the joint cell probabilities that define the win measures. The AIPW estimator additionally incorporates outcome modeling, improving efficiency and achieving double robustness. For inference, we derive closed-form variance estimators for both methods based on influence functions. Simulation studies show that the standard approach can be substantially biased, whereas the proposed IPW and AIPW estimators remain consistent with near-nominal coverage. Furthermore, the AIPW estimator is generally more efficient than IPW estimator. Applications to the SCOUT-CAP and ACTT-1 trials illustrate the practical utility of the proposed methods. An R package, WinMO, is provided for implementation.

Subjects: Methodology , Statistics Theory , Applications

Publish: 2026-05-26 14:34:45 UTC


#13 Causal Representation Learning for Generalisable Recommendation [PDF] [Copy] [Kimi1] [REL]

Authors: Yorgos Felekis, Michael O'Riordan, Oriol Corcoll, Ciarán M. Gilligan-Lee

Predictive models trained on observational data often fail to generalise to the distributions they encounter when deployed, especially when the training data is a product of the system being optimised. Recommender systems are a canonical example: they are trained on interaction logs confounded by the deployed policy, past user behaviour, and platform filtering. As a result, the training distribution differs substantially from the candidate distribution scored at serving time, a gap that makes offline metrics unreliable predictors of online performance. We address the distribution shift problem with a method motivated by causal representation learning (CRL). We propose an information-theoretic disentanglement criterion and prove that its optimum depends only on the causal components of the input. We then derive a tractable variational lower bound that makes the criterion optimisable from finite observational data alone. The scope of our method is narrower than that of much of the CRL literature, in that we target better generalisation under distribution shift, not full identification of all latent causal factors. This narrower target is what makes the method practical, requiring only the existing confounded logs, applying to any standard supervised model, and adding no inference-time cost. Our headline evaluation is an A/B test with millions of users on Spotify, applied to a production ranker for personalised playlist generation. A capacity-matched CRL variant performed on par offline but delivered substantial online gains in listener engagement. Complementary evidence on the public KuaiRand recommendation dataset and a synthetic benchmark with known causal structure shows the same pattern: offline parity with baseline, gains under distribution shift. Across all three settings, adding our causal disentanglement objective yields meaningfully better out-of-distribution generalisation.

Subjects: Machine Learning , Machine Learning , Methodology

Publish: 2026-05-26 13:58:36 UTC


#14 Conformalized Large-Scale Selective Inference with Informative and Trustworthy Prediction Sets [PDF] [Copy] [Kimi] [REL]

Authors: Wangcheng Li, Guanlan Zhao, Xu Guo, Wenguang Sun

In large-scale prediction problems, exhaustively following up on all test units is often impractical and inefficient, motivating a selective reporting strategy that fulfills the dual requirements of informativeness and trustworthiness. Within the InfoFCR (Informative prediction with False Coverage Rate control) framework, we propose SCIP (Selective Conformal Inference for Informative Predictions), a procedure built on three key components: (i) an informative set constructor that tailors prediction sets to individual test units according to user-specified informativeness constraints; (ii) a trust score that provides a principled quantification of the trustworthiness of candidate informative sets; and (iii) generalized conformal p-values that are used to perform FCR analysis for selecting the most promising candidates. We establish that SCIP guarantees finite-sample FCR control and is asymptotically anti-conservative, achieving higher statistical power than existing methods. The framework is highly versatile, accommodating a wide range of error metrics across both regression and classification tasks. Extensive numerical experiments on simulated and real data demonstrate the effectiveness of our approach.

Subjects: Statistics Theory , Methodology

Publish: 2026-05-26 13:29:40 UTC


#15 Constrained Bayesian Experimental Design via Online Planning [PDF] [Copy] [Kimi] [REL]

Authors: Yujia Guo, Daolang Huang, Xinyu Zhang, Sammie Katt, Samuel Kaski, Ayush Bharti

Bayesian experimental design (BED) is a principled framework for data-efficient design of sequential experiments. However, existing BED methods are unable to adapt to dynamic constraints inherent in real-world tasks due to budget limitations, varying costs, or physical constraints that restrict how designs evolve over time. In this paper, we introduce a novel approach to BED that enables constrained optimization of experimental designs by combining offline pre-training of an amortized policy and a posterior network with online multi-step lookahead planning using scenario trees. We empirically demonstrate that our method yields substantially more informative design sequences than existing methods across a range of constrained BED tasks, while incurring only a modest additional computational overhead.

Subjects: Machine Learning , Machine Learning

Publish: 2026-05-26 13:13:28 UTC


#16 Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks [PDF] [Copy] [Kimi] [REL]

Authors: Ali Hussaini Umar, Alessandro Laio

Neural networks are known to develop latent representations that are $aligned$, namely structurally similar across networks trained with different architectures, training protocols, or training datasets. We study this phenomenon in a controlled setting, where we train an ensemble of networks on regression and classification tasks using training sets perturbed by independent realizations of a noise process. We show that the signal-to-noise ratio (SNR) and the training sample size influence the alignment in qualitatively similar ways in networks trained on real-world datasets and in an extremely simple $linear$ network with a single hidden layer, for which the alignment can be estimated analytically. Across linear and nonlinear networks, regression and classification tasks, and both synthetic and real-world data, we consistently observe that alignment varies monotonically with SNR but non-monotonically with training sample size. In particular, the alignment is minimized near the interpolation threshold, and a stronger alignment does not necessarily correspond to better generalization error. These findings reveal a non-trivial dependence of alignment on data quality and quantity, decoupled from generalization performance.

Subjects: Machine Learning , Disordered Systems and Neural Networks , Machine Learning , Neural and Evolutionary Computing , Neurons and Cognition

Publish: 2026-05-26 12:58:48 UTC


#17 Semiparametric Inference for Causal Effects on Functional Outcomes [PDF] [Copy] [Kimi] [REL]

Authors: Junzhu Nie, Chengxiu Ling, Mengfei Ran

Difference-in-differences (DiD) is a cornerstone of causal inference, yet extending it to functional outcomes is not a routine scalar generalization; rather, it entails three fundamental challenges in identification, inference, and observation. This paper develops a comprehensive semiparametric inference framework for functional DiD with discretely observed data. First, we define the functional average treatment effect under parallel trends and derive its efficient influence function (EIF), thereby establishing the semiparametric efficiency bound. Second, leveraging Neyman orthogonality and cross-fitting, we construct a debiased estimator that effectively mitigates regularization bias arising from nonparametric reconstruction. Third, we establish weak convergence of the estimator and propose an asymptotically valid uniform confidence band, enabling a rigorous transition from pointwise to curve-level inference. Finally, we demonstrate that reconstruction error under discrete sampling is asymptotically negligible for semiparametric inference, ensuring practical feasibility. Simulations and empirical applications confirm that the proposed method achieves superior coverage and testing power in finite samples, providing a theoretically grounded and computationally tractable foundation for causal evaluation with functional data.

Subjects: Methodology , Applications

Publish: 2026-05-26 12:52:11 UTC


#18 INARMA Models for Count Random Fields -- a Survey [PDF] [Copy] [Kimi] [REL]

Authors: Angelika Silbernagel, Christian H. Weiß

The thinning-based integer-valued autoregressive moving-average (INARMA) models are popular for count time series. Recently, types of INARMA models have also been developed for count random fields, i.e., for spatial count data located on a regular two-dimensional grid. This article provides a comprehensive survey on existing INARMA random fields, covering approaches with different thinning operators, first- and higher-order models, as well as unilateral and multilateral model structures.

Subject: Statistics Theory

Publish: 2026-05-26 11:49:15 UTC


#19 Robust ensemble Kalman filtering under observation noise misspecification via diffusion score matching [PDF] [Copy] [Kimi] [REL]

Authors: Hans Reimann, Sebastian Reich

We address the problem of observation noise misspecification in Bayesian filtering of dynamical systems via recent advances in generalised Bayesian inference. Mis-match in tail decay between the true data generating process and an assumed observation model, often showing via frequent outliers, can strongly impact Bayesian updates and analysis in Kalman filtering. Existing approaches often employ detect-and-delete-schemes or covariance inflation to avoid assimilation of influential instances of mis-specification. In challenging settings where the analysis updates are barely sufficient to counteract the induced forecast uncertainty, these strategies may destabilize or struggle to provide reliable uncertainty quantification. We consider a novel Kalman filter adjusting information processing in the analysis step by employing diffusion score matching for inference to obtain robustness while maintaining well-quantified uncertainties. We provide theoretical properties of the diffusion score matching Kalman filter in linear Gaussian state space systems covering conjugacy and closed form parameter update in the analysis step, robustness, covariance stability, and tuning as well as high-dimensional consistency. We derive ensemble approximations via stochastic and deterministic coupling as well as implementing localization to obtain EnKF, ESRF and LETKF varieties. We evaluate the methods in appropriate simulation studies on target-tracking, the chaotic Lorenz 63 system and the Lorenz 96 system in 40 dimensions. Our insights highlight a critical trade-off between robustness and stability in Bayesian filtering. Methods employing generalized Bayesian inference can navigate this balance and improve data assimilation in challenging environments combining non-linear dynamics and potentially non-Gaussian observation noise.

Subjects: Statistics Theory , Dynamical Systems , Methodology

Publish: 2026-05-26 11:42:55 UTC


#20 A warning system for risk prediction of metabolic syndrome in a healthy population of blood donors [PDF] [Copy] [Kimi] [REL]

Authors: Simone Colombara, Ilenia Epifani, Alessandra Guglielmi, Ettore Lanzarone

Metabolic syndrome is a complex clinical condition characterized by the simultaneous presence of multiple metabolic risk factors and represents a major public health concern. The syndrome develops silently and may remain undiagnosed for long periods, highlighting the importance of investigating early metabolic alterations before overt disease onset. Longitudinal monitoring of predominantly healthy individuals may help identify metabolic risk early. The paper proposes a Bayesian statistical model to estimate the probability of metabolic syndrome among blood donors during pre-donation screening, incorporating information collected at previous visits. Using longitudinal data from one of the main blood donor associations in Italy, AVIS Milan, we analyze repeated clinical and lifestyle measurements from a predominantly healthy population of donors. In particular, we fit a Bayesian multivariate model that jointly represents the logarithm of the five diagnostic components of metabolic syndrome. The model accounts for within-donor dependence across repeated visits and provides probabilistic estimates of individual risk. Our framework aims to provide clinicians at AVIS Milan with an interpretable traffic-light warning system (low, intermediate, high risk) during pre-donation screening to facilitate the identification of individuals at risk of metabolic syndrome at future visits and to support targeted preventive interventions during routine donor assessment, ultimately contributing to a long-term reduction in healthcare costs for the Italian national healthcare system.

Subjects: Applications , Methodology

Publish: 2026-05-26 10:56:30 UTC


#21 Accelerated Schrödinger-Föllmer samplers [PDF] [Copy] [Kimi] [REL]

Authors: Haotian Lin, Xiaojie Wang, Xiaoyan Zhang

Sampling is a fundamental algorithmic task in wide-ranging applications across multiple disciplines such as scientific computing, statistics and machine learning. In this paper, an efficient stochastic Runge-Kutta scheme is proposed to accelerate the Schrödinger-Föllmer sampler, designed for sampling from complex and high-dimensional multimodal distributions. The resulting stochastic Runge-Kutta Schrödinger-Föllmer sampler (SRKSFS) is proved to achieve a convergence rate of order $\mathcal{O} ( h^{3/2} |\ln h|)$ in the $L^2$-Wasserstein distance, considerably improving the order $\mathcal{O}(h)$ of the existing Euler type sampler. Obtaining the enhanced convergence rate is, however, not trivial, by noting that the drift of the diffusion process is not differentiable but only $\frac{1}{2}$-Hölder continuity with respect to the time variable. To address the difficulty, we rely on delicate error estimates to overcome the singularity due to time derivatives of the drift, at the expense of the logarithmic factor. Furthermore, the framework is extended to data-driven Schrödinger-Föllmer generation with empirical measures, enabling data-driven sampling without known density. A variety of numerical experiments are reported to validate the effectiveness of the proposed sampling algorithms.

Subject: Statistics Theory

Publish: 2026-05-26 10:14:27 UTC


#22 Estimating the logistic regression equation when the model is incorrect [PDF] [Copy] [Kimi] [REL]

Author: Nils Lid Hjort

Protesting mildly against the notion of an exactly correct parametric model the view is adopted that the logistic regression equation is merely an approximation to the underlying, true function. The behaviour of likelihood based estimators is investigated in such a general framework. The maximum likelihood estimator is shown to be consistent for a certain least false parameter value minimising a weighted average of quantities that measure the distance from the true to the parametric model. Asymptotic normality is also demonstrated. Finally a number of additional remarks are offered, some pointing to natural generalisations and some to new questions for research, like weighted and local likelihood estimation methods.

Subject: Statistics Theory

Publish: 2026-05-26 09:26:07 UTC


#23 Marginal likelihoods for finite-support Huber contamination [PDF] [Copy] [Kimi] [REL]

Author: Jaehoan Kim

For Huber contamination on a known finite sample space, the unrestricted contaminating law is a probability vector on the support atoms, and domination over all measurable subsets reduces to atomwise inequalities. Placing a Dirichlet prior on this probability vector and a Beta prior on the contamination proportion gives an exact marginal likelihood for the structural parameter after analytic integration of both nuisance quantities. The likelihood is a finite weighted sum over allocations of the observed counts between the structural and contaminating components. For fixed support size, this sum and its score can be evaluated by a dynamic program with quadratic cost in the sample size, enabling gradient-based posterior sampling.

Subjects: Methodology , Statistics Theory , Computation

Publish: 2026-05-26 09:03:09 UTC


#24 Transformers Can Learn Posterior Predictive Distributions In-Context [PDF] [Copy] [Kimi] [REL]

Authors: Gyeonghun Kang, Changwoo J. Lee, Xiang Cheng

Prior-data fitted networks (PFNs) have recently emerged as a powerful approach for Bayesian prediction tasks, approximating the posterior predictive distribution (PPD) through in-context learning. Despite their strong empirical performance and ability to go beyond point predictions, theoretical understandings of the algorithmic capability of transformers to learn distributions in context are still lacking. Focusing on Gaussian process regression problems, we show by construction that transformers can implement a gradient descent algorithm targeting the posterior predictive mean and variance, followed by nonlinear mappings that yield binned probabilities of PPD. We study the error bounds of the approximated PPD in terms of attention depth and bin resolution. Based on these results, we further demonstrate the key role of normalization and the choice of attention depth in enabling the extrapolation abilities of transformers beyond the pretraining sample size range. We conduct simulations that corroborate our findings, providing insight into the expressivity of PFNs targeting PPDs and how architectural choices may influence generalization capabilities.

Subjects: Machine Learning , Machine Learning

Publish: 2026-05-26 08:54:02 UTC


#25 CART Random Forests as Sequential Allocation over Random Opportunity Sets: A Stochastic-Control Theory of Ensemble Risk [PDF] [Copy] [Kimi] [REL]

Authors: Tianxing Mei, Yingying Fan, Mingming Leng, Jinchi Lv

CART random forests are among the most widely used modern predictive methods, with well-documented empirical success. Yet, at the mechanistic level, the algorithm is often treated as a black box because of its complexity. In this paper, we develop a stochastic-control perspective on feature-subsampled CART random forests, named CART random opportunity-set allocation (CART-ROSA). At each node, the random subset of features is interpreted as a random feasible action set, and the CART split rule as a masked-action allocation policy. This policy induces a controlled stochastic process over informative split-count states, whose terminal law determines both single-tree error and cross-tree interaction terms in the forest mean squared error (MSE). Such representation opens the black box of CART-forests by separating two design levers: the informative-opportunity rate induced by feature subsampling, and the contraction strength from the within-mask split policy. We establish that the CART policy is locally stabilizing: it contracts imbalances in informative split allocations and concentrates terminal tree geometry. At the system level, however, it can be globally suboptimal for the forest objective. Specializing to the linear model, we derive the MSE risk expansion explicitly. Our results show how an operations-research perspective makes tractable a theoretical gap difficult to access from the standard algorithmic description of CART forests.

Subjects: Machine Learning , Machine Learning

Publish: 2026-05-26 08:13:22 UTC