Statistics

2026-04-16 | | Total: 46

#1 The Epidemiology of Artificial Intelligence [PDF] [Copy] [Kimi] [REL]

Authors: Harsh Parikh, Tyler McCormick, Emily Johnson, Leo Hickey, Megan Ranney, Bhramar Mukherjee

Artificial intelligence (AI) systems increasingly shape how people access health information, make medical decisions, and receive care -- yet epidemiology lacks frameworks for measuring AI exposure or studying its health effects at the population level. Here we argue that AI now functions as a determinant of health and propose a conceptual framework, borrowed from environmental epidemiology, for studying it. We distinguish ambient AI exposure -- algorithmic curation and AI-mediated institutional decisions that affect populations regardless of individual choice -- from personal AI exposure -- direct, volitional use of AI tools. We characterize AI's possible causal roles in epidemiological models, show that existing experimental approaches are inadequate for capturing chronic, population-level effects, and illustrate these ideas with nationally representative US survey data. We discuss implications for study design, health equity, and AI governance.

Subject: Other Statistics

Publish: 2026-04-15 16:59:10 UTC


#2 Finite-Step Bounds for Iterated Correlation Matrices [PDF] [Copy] [Kimi] [REL]

Author: Ishrak AlhajjHassan

We establish finite-step probabilistic upper bounds on the contraction ratios $ρ_k = Δ_{k+1}/Δ_k$ for iterated Pearson correlation dynamics. Let $(P_k)_{k\ge 0}$ be the sequence generated by the Pearson update. Define $Δ_k := \|P_{k+1}-P_k\|_F$, $ρ_k := Δ_{k+1}/Δ_k$ for $Δ_k > 0$, and $δ_k := Δ_k/n$. Although $Δ_k \to 0$ along convergent trajectories, the ratios $ρ_k$ may exceed unity in finitely many steps. This behavior is invisible to local linearization. Our main contribution is a probabilistic bounding framework that captures these finite-step expansions. We initialize $P_0$ with i.i.d. $\mathcal{U}[-1,1]$ entries and let $\mathbb{P}$ be the induced measure. For $k \ge 2$, we construct state-dependent bounds $B_p : \mathbb{R}_+ \to \mathbb{R}_+$ satisfying $\mathbb{P}(ρ_k \le B_p(δ_k)) \ge p$. The functions $B^{\mathrm{q}}_p(δ)$ are empirical conditional $p$-quantiles of $\log ρ_k$ given $δ_k$ under logarithmic binning. Larger families $B^{\mathrm{TC}}_{p,τ}(δ)$ and $B^{\mathrm{tol}}_{p,τ}(δ)$ are obtained via multiplicative adjustments, yielding pointwise larger bounds that preserve the $δ$-dependence. Validation on held-out trajectories confirms the bounds hold with empirical coverage matching nominal levels for all $n \in [3,2000]$. The baseline $0.95$-quantile bound $B^{\mathrm{q}}_{0.95}(δ)$ yields two concrete results: $\mathbb{P}(ρ\le 1 \mid δ\le 0.03) \ge 0.95$ uniformly in $n$, and $\mathbb{P}(ρ\le 1.7) \ge 0.95$ for 21 of 22 dimensions. The exception $n = 69$ attains $2.35$, revealing a rare extreme upper tail discontinuity not captured by asymptotic analysis. These are the first finite-step probabilistic bounds for Pearson correlation dynamics. The framework is fully reproducible with provided code and data.

Subjects: Statistics Theory , Dynamical Systems

Publish: 2026-04-15 16:41:43 UTC


#3 Improving Treatment Effect Estimation in Trials through Adaptive Borrowing of External Controls [PDF] [Copy] [Kimi] [REL]

Authors: Qinwei Yang, Jingyi Li, Peng Wu, Shu Yang

Randomized controlled trials (RCTs) often suffer from limited inferential efficiency in estimating treatment effects due to their small sample sizes. In recent years, incorporating external controls (ECs) has gained increasing attention as an effective way to augment small RCTs and thereby enhance estimation efficiency. However, ECs are not always comparable to RCTs, and direct borrowing without careful evaluation can introduce substantial bias and, paradoxically, undermine the accuracy of treatment effect estimation. In this paper, we propose a novel adaptive influence-based sample borrowing framework to improve average treatment effect (ATE) estimation in RCTs. The framework quantifies the ``comparability'' of each sample in ECs using influence functions and identifies the optimal subset of ECs that minimizes the mean squared error of the ATE estimator. The proposed framework is assumption-lean regarding the distribution of ECs and is robust to outliers, making it broadly applicable across diverse settings. Moreover, we develop an outcome calibration method to improve the data utilization efficiency of ECs, further strengthening the adaptive influence-based sample-borrowing framework. We demonstrate the effectiveness of the proposed method using both simulated and real-world datasets.

Subject: Methodology

Publish: 2026-04-15 15:20:27 UTC


#4 High-Dimensional Data Analysis for Elliptically Symmetric Distributions [PDF] [Copy] [Kimi] [REL]

Author: Long Feng

High-dimensional data arise routinely in modern statistics, econometrics, finance, genomics, and machine learning. While a large body of existing methodology is developed under Gaussian or light-tailed assumptions, many real data sets exhibit heavy tails, heterogeneity, and departures from classical covariance-based models. This book provides a systematic treatment of high-dimensional data analysis under elliptically symmetric distributions, with an emphasis on robust inference based on spatial signs, spatial ranks, multivariate Kendall's tau matrices, and related shape-based methods.The book covers the basic theory of elliptical symmetry, high-dimensional location inference, estimation and testing for covariance and precision matrices, sphericity and proportionality testing, high-dimensional alpha testing in factor pricing models, change-point analysis, white-noise and independence testing, high-dimensional discriminant analysis, and dimension reduction through principal component analysis and factor models. Throughout, we review classical low-dimensional and high-dimensional benchmark methods and then develop robust alternatives tailored to elliptical models. Particular attention is paid to the interplay between sum-type, max-type, and adaptive procedures, as well as to the role of scatter, shape, and rank-based dependence measures in heavy-tailed settings. This book is intended as a unified overview of robust high-dimensional methods under elliptical symmetry and as a synthesis of the author's recent research contributions in this area. It is written for researchers and graduate students in statistics, econometrics, and related fields who are interested in modern high-dimensional inference beyond the Gaussian paradigm.

Subject: Methodology

Publish: 2026-04-15 14:56:58 UTC


#5 The Integer-valued Moving-Average Random Field [PDF] [Copy] [Kimi] [REL]

Authors: Angelika Silbernagel, Christian H. Weiß

An integer-valued moving average (INMA) model for count random fields is proposed and investigated. Closed-form expressions are derived for both its marginal distribution and spatial dependence structure, for arbitrary model order. In particular, general expressions for bivariate distributions and autocovariances are provided. It is shown that the INMA random field can be equipped (among others) with a Poisson marginal distribution. It is also illustrated that different and well-interpretable dependence structures are possible.

Subject: Statistics Theory

Publish: 2026-04-15 12:09:54 UTC


#6 Testing Alpha in High-Dimensional Conditional Time-Varying Factor Models with Dependent Observations [PDF] [Copy] [Kimi] [REL]

Authors: Long Feng, Huifang Ma, Zhaojun Wang

This paper studies alpha testing in a high-dimensional conditional time-varying factor model with temporally dependent observations. Both factor loadings and alpha processes are allowed to vary smoothly over time, and the cross-sectional dimension may be comparable to or larger than the sample size. Using a B-spline sieve method, we develop a sum-type test for dense alternatives, a max-type test for sparse alternatives, and a Cauchy combination test for adaptive inference. On the theoretical side, we derive explicit stochastic expansions for the estimated average alphas, establish asymptotic normality of the sum statistic, and develop the extreme-value limit theory for the max statistic by showing its Gumbel convergence under temporal dependence together with the validity of block-bootstrap calibration. We further prove asymptotic independence between the sum and max statistics and thereby justify the Cauchy combination test. Simulation results demonstrate that the proposed procedures achieve satisfactory size control and competitive power across a wide range of dense and sparse alternatives. An empirical application further illustrates the usefulness of the proposed methods in testing asset-pricing models with time-varying structure.

Subject: Methodology

Publish: 2026-04-15 12:02:52 UTC


#7 Forecasting Multivariate Time Series under Predictive Heterogeneity: A Validation-Driven Clustering Framework [PDF] [Copy] [Kimi] [REL]

Authors: Ziling Ma, Ángel López Oriona, Hernando Ombao, Ying Sun

We study adaptive pooling under predictive heterogeneity in high-dimensional multivariate time series forecasting, where global models improve statistical efficiency but may fail to capture heterogeneous predictive structure, while naive specialization can induce negative transfer. We formulate adaptive pooling as a statistical decision problem and propose a validation-driven framework that determines when and how specialization should be applied. Rather than grouping series based on representation similarity, we define partitions through out-of-sample predictive performance, thereby aligning data organization with predictive risk, defined as expected out-of-sample loss and approximated via validation error. Cluster assignments are iteratively updated using validation losses for both point (Huber) and probabilistic (pinball) forecasting, improving robustness to heavy-tailed errors and local anomalies. To ensure reliability, we introduce a leakage-free fallback mechanism that reverts to a global model whenever specialization fails to improve validation performance, providing a safeguard against performance degradation under a strict training-validation-test protocol. Experiments on large-scale traffic datasets demonstrate consistent improvements over strong baselines while avoiding degradation when heterogeneity is weak. Overall, the proposed framework provides a principled and practically reliable approach to adaptive pooling in high-dimensional forecasting problems.

Subjects: Methodology , Machine Learning

Publish: 2026-04-15 11:35:32 UTC


#8 Covariance-adapting algorithm for semi-bandits with application to sparse rewards [PDF] [Copy] [Kimi] [REL]

Authors: Pierre Perrault, Vianney Perchet, Michal Valko

We investigate stochastic combinatorial semi-bandits, where the entire joint distribution of outcomes impacts the complexity of the problem instance (unlike in the standard bandits). Typical distributions considered depend on specific parameter values, whose prior knowledge is required in theory but quite difficult to estimate in practice; an example is the commonly assumed sub-Gaussian family. We alleviate this issue by instead considering a new general family of sub-exponential distributions, which contains bounded and Gaussian ones. We prove a new lower bound on the expected regret on this family, that is parameterized by the unknown covariance matrix of outcomes, a tighter quantity than the sub-Gaussian matrix. We then construct an algorithm that uses covariance estimates, and provide a tight asymptotic analysis of the regret. Finally, we apply and extend our results to the family of sparse outcomes, which has applications in many recommender systems.

Subjects: Machine Learning , Machine Learning

Publish: 2026-04-15 11:27:39 UTC


#9 Adaptive Sample Size Simulations with R package adsasi [PDF] [Copy] [Kimi] [REL]

Author: Skerdi Haviari

Planning empirical experiments such as clinical trials or A/B tests requires sample size determination, which in many interesting cases has no closed-form solution (e.g. factorial or adaptive designs). adsasi is a new R package that enables simulations-first sample size calculations for any trial that can be simulated in short compute time. First, the user specifies as a function that takes a sample size as argument, simulates the experiment, and returns a boolean for success/failure. Then, adsasi functions adsasi_0d and adsasi_1d iteratively call it on different sample sizes and progressively home in on the one with nominal success rate (power), assuming that increasing sample size increases power. adsasi_1d can also draw, purely empirically, the relationship between a design parameter and sample size. The implementation uses a modified probit regression (with success/failure as the dependent variable), informed by simulations conducted around the target size, and provides standard errors at each stage using the Cramér-Rao bound derived from a custom analytical Hessian matrix. Simple examples are first presented, yielding results within Monte Carlo variance of their closed-form expressions, then intractable ones (including bootstrapping from an existing medical cohort). adsasi will hopefully facilitate the funding and conduct of interesting, highly complex experimental designs by making their sizing straightforward.

Subject: Methodology

Publish: 2026-04-15 10:37:14 UTC


#10 Fractional lower-order covariance-based measures for cyclostationary time series with heavy-tailed distributions: application to dependence testing and model order identification [PDF] [Copy] [Kimi] [REL]

Authors: Wojciech Żuławiński, Agnieszka Wyłomańska

This article introduces new methods for the analysis of cyclostationary time series with infinite variance. Traditional cyclostationary analysis, based on periodically correlated (PC) processes, relies on the autocovariance function (ACVF). However, the ACVF is not suitable for data exhibiting a heavy-tailed distribution, particularly with infinite variance. Thus, we propose a novel framework for the analysis of cyclostationary time series with heavy-tailed distribution, utilizing the fractional lower-order covariance (FLOC) as an alternative to covariance. This leads to the introduction of two new autodependence measures: the periodic fractional lower-order autocorrelation function (peFLOACF) and the periodic fractional lower-order partial autocorrelation function (peFLOPACF). These measures generalize the classical periodic autocorrelation function (peACF) and periodic partial autocorrelation function (pePACF), offering robust tools for analyzing infinite-variance processes. Two practical applications of the proposed measures are explored: a portmanteau test for testing dependence in cyclostationary series and a method for order identification in periodic autoregressive (PAR) and periodic moving average (PMA) models with infinite variance. Both applications demonstrate the potential of new tools, with simulations validating their efficiency. The methodology is further illustrated through the analysis of real-world air pollution data, which showcases its practical utility. The results indicate that the proposed measures based on FLOC provide reliable and efficient techniques for analyzing cyclostationary processes with heavy-tailed distributions.

Subject: Methodology

Publish: 2026-04-15 10:10:43 UTC


#11 Relative plausibility versus probabilism: A level-of-analysis error in juridical proof [PDF] [Copy] [Kimi] [REL]

Author: Stanley E. Lazic

Debates about juridical proof are often framed as a conflict between probabilistic approaches and relative plausibility theory (RPT). This paper argues that this opposition rests on a level-of-analysis error. Drawing on Marr's distinction between levels of analysis, we show that RPT and probabilistic approaches operate at different conceptual levels and are therefore compatible rather than competing theories. RPT provides a computational-level description of juridical proof, characterizing the task of comparing explanations in light of the evidence and assessing whether a standard of proof has been met. Probabilistic approaches supply algorithmic-level accounts that specify how such comparative assessments can be represented and computed. When plausibility judgments satisfy minimal coherence conditions, relative plausibility corresponds to posterior odds. Recognizing this distinction clarifies longstanding disputes and highlights the complementary roles of explanation and probability in legal reasoning.

Subject: Applications

Publish: 2026-04-15 06:41:30 UTC


#12 Robust Low-Rank Tensor Completion based on M-product with Weighted Correlated Total Variation and Sparse Regularization [PDF] [Copy] [Kimi] [REL]

Authors: Biswarup Karmakar, Ratikanta Behera

The robust low-rank tensor completion problem addresses the challenge of recovering corrupted high-dimensional tensor data with missing entries, outliers, and sparse noise commonly found in real-world applications. Existing methodologies have encountered fundamental limitations due to their reliance on uniform regularization schemes, particularly the tensor nuclear norm and $\ell_1$ norm regularization approaches, which indiscriminately apply equal shrinkage to all singular values and sparse components, thereby compromising the preservation of critical tensor structures. The proposed tensor weighted correlated total variation (TWCTV) regularizer addresses these shortcomings through an $M$-product framework that combines a weighted Schatten-$p$ norm on gradient tensors for low-rankness with smoothness enforcement and weighted sparse components for noise suppression. The proposed weighting scheme adaptively reduces the thresholding level to preserve both dominant singular values and sparse components, thus improving the reconstruction of critical structural elements and nuanced details in the recovered signal. Through a systematic algorithmic approach, we introduce an enhanced alternating direction method of multipliers (ADMM) that offers both computational efficiency and theoretical substantiation, with convergence properties comprehensively analyzed within the $M$-product framework.Comprehensive numerical evaluations across image completion, denoising, and background subtraction tasks validate the superior performance of this approach relative to established benchmark methods.

Subjects: Machine Learning , Machine Learning , Optimization and Control

Publish: 2026-04-15 06:20:28 UTC


#13 Joint Representation Learning and Clustering via Gradient-Based Manifold Optimization [PDF] [Copy] [Kimi] [REL]

Authors: Sida Liu, Yangzi Guo, Mingyuan Wang

Clustering and dimensionality reduction have been crucial topics in machine learning and computer vision. Clustering high-dimensional data has been challenging for a long time due to the curse of dimensionality. For that reason, a more promising direction is the joint learning of dimension reduction and clustering. In this work, we propose a Manifold Learning Framework that learns dimensionality reduction and clustering simultaneously. The proposed framework is able to jointly learn the parameters of a dimension reduction technique (e.g. linear projection or a neural network) and cluster the data based on the resulting features (e.g. under a Gaussian Mixture Model framework). The framework searches for the dimension reduction parameters and the optimal clusters by traversing a manifold,using Gradient Manifold Optimization. The obtained The proposed framework is exemplified with a Gaussian Mixture Model as one simple but efficient example, in a process that is somehow similar to unsupervised Linear Discriminant Analysis (LDA). We apply the proposed method to the unsupervised training of simulated data as well as a benchmark image dataset (i.e. MNIST). The experimental results indicate that our algorithm has better performance than popular clustering algorithms from the literature.

Subjects: Machine Learning , Machine Learning

Publish: 2026-04-15 05:18:27 UTC


#14 Estimating Continuous Treatment Effects with Two-Stage Kernel Ridge Regression [PDF] [Copy] [Kimi] [REL]

Authors: Seok-Jin Kim, Kaizheng Wang

We study the problem of estimating the effect function for a continuous treatment, which maps each treatment value to a population-averaged outcome. A central challenge in this setting is confounding: treatment assignment often depends on covariates, creating selection bias that makes direct regression of the response on treatment unreliable. To address this issue, we propose a two-stage kernel ridge regression method. In the first stage, we learn a model for the response as a function of both treatment and covariates; in the second stage, we use this model to construct pseudo-outcomes that correct for distribution shift, and then fit a second model to estimate the treatment effect. Although the response varies with both treatment and covariates, the induced effect function obtained by averaging over covariates is typically much simpler, and our estimator adapts to this structure. Furthermore, we introduce a fully data-driven model selection procedure that achieves provable adaptivity to both the unknown degree of overlap and the regularity (eigenvalue decay) of the underlying kernel.

Subjects: Methodology , Machine Learning , Machine Learning

Publish: 2026-04-15 02:21:15 UTC


#15 Leveraging machine learning to estimate individualized treatment effects in cluster-randomized trials [PDF] [Copy] [Kimi] [REL]

Authors: Changjun Li, Xi Fang, Michael O. Harhay, Andrew B. Forbes, F. Perry Wilson, Guangyu Tong, Fan Li

Cluster-randomized trials (CRTs) are widely used to evaluate interventions delivered at the clinic, practice, or community level. Although standard analyses typically target average treatment effects, such summaries mask potentially meaningful variation in treatment response across individuals and clusters. This work addresses the estimation of conditional average treatment effects (CATEs) for continuous outcomes in two-arm parallel CRTs by defining causal estimands that incorporate both individual- and cluster-level baseline covariates while marginalizing over unobserved cluster heterogeneity. To estimate these quantities, we develop a unified framework based on mixed-effects machine learning, integrating and extending a range of existing approaches, including Bayesian additive regression trees with random effects, multilevel Bayesian causal forests, mixed-effects random forests, several mixed-effects gradient boosting procedures, and generalized additive mixed models, while incorporating cluster-specific random intercepts to account for within-cluster dependence. We evaluate these methods across diverse simulation scenarios and demonstrate their use in the Task Shifting and Blood Pressure Control in Ghana CRT, which investigates strategies for improving hypertension management. Drawing on these investigations, we provide practical guidance for applying mixed-effects machine learning to quantify treatment-effect heterogeneity in CRTs, together with reproducible code that enables investigators to implement all methods within a coherent workflow.

Subject: Methodology

Publish: 2026-04-15 02:15:28 UTC


#16 A Machine Learning Framework for Uncertainty-Calibrated Capability Decision under Finite Samples [PDF] [Copy] [Kimi] [REL]

Authors: Fei Jiang, Lei Yang

Process capability indices such as $C_{pk}$ are widely used for manufacturing decisions, yet are typically applied via deterministic thresholding of finite-sample estimates, ignoring uncertainty and leading to unstable outcomes near the capability boundary. This paper reformulates capability approval as a decision-risk calibration problem, quantifying the probability of misclassification under finite-sample variability. We propose an uncertainty-aware hybrid framework that combines a statistically grounded baseline with a data-driven residual learner, where the baseline provides an interpretable approximation of failure risk and the residual captures systematic deviations due to non-normality, measurement effects, and finite-sample uncertainty. A nested Monte Carlo procedure is introduced to approximate oracle decision risk under controlled synthetic settings, enabling direct evaluation of probabilistic calibration. Empirical results show that conventional approaches exhibit substantial miscalibration in near-threshold regimes, while the proposed framework provides a structured and uncertainty-aware representation of decision risk that remains stable under stricter leak-free evaluation. The framework is simple, compatible with existing capability metrics, and readily deployable in industrial analytics systems.

Subject: Applications

Publish: 2026-04-14 23:22:55 UTC


#17 Newton's Algorithm as a Gradient Flow: A Geometric Framework for Recursive Mixture Estimation [PDF] [Copy] [Kimi] [REL]

Author: Bernardo Flores

Bayesian nonparametric mixture models provide a flexible framework for data analysis but are often hindered by the computational expense of traditional inference methods like MCMC. A fast, recursive algorithm proposed by Newton (2002) offers a practical alternative, yet its formal connection to Bayesian inference and its theoretical properties remain only partially understood. This paper reveals a new geometric interpretation of this class of predictive recursions. We demonstrate that Newton's recursion is a discrete-time approximation of a gradient flow on the space of probability measures governed by the Fisher-Rao geometry, providing the first rigorous dynamical characterisation of this family of estimators. This geometric perspective provides a principled theoretical foundation for studying these recursions: it clarifies their convergence behaviour, situates them within the variational Bayes literature, and yields a systematic basis for generalisation by modifying the underlying geometry and discretisation. In contrast to approaches that construct gradient flows from a prescribed variational objective, this work proceeds in the reverse direction: beginning from an existing recursive estimator and uncovering the variational problem it implicitly solves, it opens a pathway for the systematic analysis and extension of a broad class of sequential Bayesian estimators.

Subject: Methodology

Publish: 2026-04-14 23:07:03 UTC


#18 Addressing Confounding by Indication Through (Un)Measured Centre Characteristics in Learn-As-you-GO(LAGO) Trials [PDF] [Copy] [Kimi] [REL]

Authors: Minh Thu Bui, Christopher T. Longenecker, Ante Bing, Donna Spiegelman, Allison R. Webel, Hayden B. Bosworth, Judith J. Lok

Adaptive clinical trial designs have gained popularity, allowing for modifications to sample sizes, participant populations, treatment arm selection, and other parameters. However, existing adaptive trial designs do not address changes to the intervention packages themselves, which have a reputation for invalidating statistical inferences. This has motivated the development of Learn-As-you-GO (LAGO), an adaptive clinical trial design that allows for modifications to multicomponent intervention packages over different stages. Centre characteristics might be confounders, predicting both the intervention package implemented and the outcomes in the centres. This work extends LAGO theory by using fixed centre effects to control for confounding by indication through both measured and unmeasured centre-specific characteristics. We show that the fixed centre effects provide reliable control for centre-level confounding even with small numbers of centres, demonstrating the applicability of this LAGO design across various trial settings. We also extend LAGO to allow centres to participate in more than one stage, which is realistic in large-scale implementation trials. Point and interval estimators for the intervention effects are derived. Consistency and asymptotic normality of the intervention effect estimators are established. Moreover, we provide valid hypothesis tests for the overall intervention effect. The optimal intervention package achieving a predetermined outcome mean while minimizing cost is estimated through constrained optimization.

Subjects: Methodology , Statistics Theory

Publish: 2026-04-14 20:13:26 UTC


#19 Sequential Change Detection for Multiple Data Streams with Differential Privacy [PDF] [Copy] [Kimi] [REL]

Authors: Lixing Zhang, Liyan Xie, Ruizhi Zhang

Sequential change-point detection seeks to rapidly identify distributional changes in streaming data while controlling false alarms. Existing multi-stream detection methods typically rely on non-private access to raw observations or intermediate statistics, limiting their usage in privacy-sensitive settings. We study sequential change-point detection for multiple data streams under differential privacy constraints. We consider multiple independent streams undergoing a synchronized change at an unknown time and in an unknown subset of streams, and propose DP-SUM-CUSUM, a differentially private detection procedure based on the summation of per-stream CUSUM statistics with calibrated Laplace noise injection. We show that DP-SUM-CUSUM satisfies sequential $\varepsilon$-differential privacy and derive bounds on the average run length to false alarm and the worst-case average detection delay, explicitly characterizing the privacy--efficiency tradeoff. A truncation-based extension is also presented to handle distributional shifts with unbounded log-likelihood ratios. Simulations and experiments on an Internet of Things (IoT) botnet dataset validate the proposed approach.

Subjects: Statistics Theory , Cryptography and Security

Publish: 2026-04-14 20:08:07 UTC


#20 Efficient estimation of cumulative incidence curves via data fusion with surrogates: application to integrated analysis of vaccine trial and immunobridging data [PDF] [Copy] [Kimi] [REL]

Authors: Pan Zhao, Peter B. Gilbert, Oliver Dukes, Bo Zhang

Refined vaccine regimens containing variant-matched inserts are often authorized based on historical phase 3 efficacy trials together with immunobridging studies. Phase 3 trials are essential for establishing immune biomarkers that reliably predict disease risk or vaccine efficacy against clinical endpoints. Once such immune correlates are identified, updated vaccine regimens can be approved through immunobridging designs that compare the immunogenicity of the updated regimen to that of an already-approved vaccine. We develop methods of inference for the counterfactual cumulative incidence curve using participant-level data from both a historical vaccine efficacy trial and an immunobridging study. We further extend these methods to pathogens with multiple serotypes -- such as dengue virus and influenza -- by estimating cause-specific cumulative incidence curves. We describe the identification assumptions, propose efficient and multiply robust estimators, and assess their finite-sample performance through simulation studies. We then apply the proposed methods to (1) estimating the hypothetical cumulative incidence curve for a bivalent mRNA booster and (2) testing a key assumption of no controlled direct effects, using data from the COVID-19 Variant Immunologic Landscape (COVAIL) Trial, a multistage randomized clinical study evaluating the safety and immunogenicity of a second COVID-19 booster dose.

Subjects: Methodology , Applications

Publish: 2026-04-14 19:53:16 UTC


#21 Estimating effect thresholds and beyond: A flexible framework for multivariate alert detection [PDF] [Copy] [Kimi] [REL]

Authors: Lucia Ameis, Niklas Hagemann, Kathrin Möllenhoff

Evaluating the influence of continuous covariates, like exposure time or dose, on a response variable is a pivotal objective in the assessment of a compound's effect, particularly when determining toxicity in pre-clinical research or pharmacokinetics in clinical trials. The determination of an alert, such as the ED50 value, at which a pre-specified threshold of the response variable is crossed, is an important tool for the evaluation process. In practice, response data might be available for combinations of different covariates and the alert depending on both is of interest. In this case, it is crucial to use all available information and extrapolate between cases to ensure the optimal utilization of the data. In this paper, we introduce a parametric approach that allows alerts to be estimated in a multidimensional setting. For time-dose-response data, for instance, alert doses at a given time can be determined, even when there are no measurements available at that exact time. Likewise, it allows estimation of alert times for a given dose. More generally, the method makes it possible to characterize the complete alert relationship between covariates by leveraging all available data. This is achieved by fitting a parametric model and constructing either a confidence band for the two-dimensional curve given for example a fixed time or dose or by constructing a confidence plane for the three-dimensional model fit. The initial model fit is achieved by the flexible framework of Generalized Additive Models for Location, Scale and Shape (GAMLSS), which offers the possibility to account for a plethora of complex three-dimensional data structures. We demonstrate the validity of our approach through a simulation study and present an application to data from a study investigating the relevance of the exposure duration on cytotoxicity in primary human hepatocytes.

Subjects: Methodology , Applications

Publish: 2026-04-14 19:52:57 UTC


#22 Identifiability of Potentially Degenerate Gaussian Mixture Models With Piecewise Affine Mixing [PDF] [Copy] [Kimi] [REL]

Authors: Danru Xu, Sébastien Lachapelle, Sara Magliacane

Causal representation learning (CRL) aims to identify the underlying latent variables from high-dimensional observations, even when variables are dependent with each other. We study this problem for latent variables that follow a potentially degenerate Gaussian mixture distribution and that are only observed through the transformation via a piecewise affine mixing function. We provide a series of progressively stronger identifiability results for this challenging setting in which the probability density functions are ill-defined because of the potential degeneracy. For identifiability up to permutation and scaling, we leverage a sparsity regularization on the learned representation. Based on our theoretical results, we propose a two-stage method to estimate the latent variables by enforcing sparsity and Gaussianity in the learned representations. Experiments on synthetic and image data highlight our method's effectiveness in recovering the ground-truth latent variables.

Subjects: Machine Learning , Artificial Intelligence , Machine Learning , Statistics Theory

Publish: 2026-04-14 18:39:08 UTC


#23 Rare Event Analysis via Stochastic Optimal Control [PDF] [Copy] [Kimi] [REL]

Authors: Yuanqi Du, Jiajun He, Dinghuai Zhang, Eric Vanden-Eijnden, Carles Domingo-Enrich

Rare events such as conformational changes in biomolecules, phase transitions, and chemical reactions are central to the behavior of many physical systems, yet they are extremely difficult to study computationally because unbiased simulations seldom produce them. Transition Path Theory (TPT) provides a rigorous statistical framework for analyzing such events: it characterizes the ensemble of reactive trajectories between two designated metastable states (reactant and product), and its central object--the committor function, which gives the probability that the system will next reach the product rather than the reactant--encodes all essential kinetic and thermodynamic information. We introduce a framework that casts committor estimation as a stochastic optimal control (SOC) problem. In this formulation the committor defines a feedback control--proportional to the gradient of its logarithm--that actively steers trajectories toward the reactive region, thereby enabling efficient sampling of reactive paths. To solve the resulting hitting-time control problem we develop two complementary objectives: a direct backpropagation loss and a principled off-policy Value Matching loss, for which we establish first-order optimality guarantees. We further address metastability, which can trap controlled trajectories in intermediate basins, by introducing an alternative sampling process that preserves the reactive current while lowering effective energy barriers. On benchmark systems, the framework yields markedly more accurate committor estimates, reaction rates, and equilibrium constants than existing methods.

Subjects: Machine Learning , Machine Learning , Optimization and Control , Chemical Physics

Publish: 2026-04-14 18:34:38 UTC


#24 Adaptive Learning via Off-Model Training and Importance Sampling for Fully Non-Markovian Optimal Stochastic Control. Complete version [PDF] [Copy] [Kimi] [REL]

Authors: Dorival Leão, Alberto Ohashi, Simone Scotti, Adolfo M. D da Silva

This paper studies continuous-time stochastic control problems whose controlled states are fully non-Markovian and depend on unknown model parameters. Such problems arise naturally in path-dependent stochastic differential equations, rough-volatility hedging, and systems driven by fractional Brownian motion. Building on the discrete skeleton approach developed in earlier work, we propose a Monte Carlo learning methodology for the associated embedded backward dynamic programming equation. Our main contribution is twofold. First, we construct explicit dominating training laws and Radon--Nikodym weights for several representative classes of non-Markovian controlled systems. This yields an off-model training architecture in which a fixed synthetic dataset is generated under a reference law, while the dynamic programming operators associated with a target model are recovered by importance sampling. Second, we use this structure to design an adaptive update mechanism under parametric model uncertainty, so that repeated recalibration can be performed by reweighting the same training sample rather than regenerating new trajectories. For fixed parameters, we establish non-asymptotic error bounds for the approximation of the embedded dynamic programming equation via deep neural networks. For adaptive learning, we derive quantitative estimates that separate Monte Carlo approximation error from model-risk error. Numerical experiments illustrate both the off-model training mechanism and the adaptive importance-sampling update in structured linear-quadratic examples.

Subjects: Machine Learning , Machine Learning , Probability

Publish: 2026-04-14 16:32:46 UTC


#25 $p$-adic Linear Regression for Random Sampling with Digitwise Noise [PDF] [Copy] [Kimi] [REL]

Author: Tomoki Mihara

We propose a new probabilistic algorithm of $p$-adic linear regression for random sampling with digitwise noise. This includes a new probabilistic algorithm of modulo $p$ linear regression.

Subjects: Computation , Number Theory , Statistics Theory

Publish: 2026-04-14 11:23:40 UTC