Statistics

2026-05-07 | | Total: 63

#1 Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval [PDF] [Copy] [Kimi1] [REL]

Authors: Nicholas Barnfield, Juno Kim, Eshaan Nichani, Jason D. Lee, Yue M. Lu

How many key-value associations can a $d\times d$ linear memory store? We show that the answer depends not only on the $d^2$ degrees of freedom in the memory matrix, but also on the retrieval criterion. In an isotropic Gaussian model for the stored pairs, we show that top-1 retrieval, where every signal must beat its largest distractor, requires the logarithmic model-size scale $d^2\asymp n\log n$. We prove that the correlation matrix memory construction, which stores associations by superposing key-target outer products, achieves this scale through a sharp phase transition, and that the same scaling is necessary for any linear memory. Thus the logarithm is the intrinsic extreme-value price of winner-take-all decoding. We next consider listwise retrieval, where the correct target need not be the unique top-scoring item but should remain among the strongest candidates. To formalize this regime, we propose the Tail-Average Margin (TAM), a convex upper-tail criterion that certifies inclusion of the correct target in a controlled candidate list. Under this listwise retrieval criterion, the capacity follows the quadratic scale $d^2\asymp n$. At load $n/d^2\toα$, we develop an exact asymptotic theory for the TAM empirical-risk minimizer through a two-parameter scalar variational principle. The theory has a rich phenomenology: in the ridgeless limit it yields a closed-form critical load separating satisfiable and unsatisfiable phases, and it predicts the limiting laws of true scores, competitor scores, margins, and percentile profiles. Finally, a small-tail extrapolation further leads to the conjectural sharp top-1 threshold $d^2\sim 2n\log n$.

Subjects: Machine Learning , Information Theory , Machine Learning

Publish: 2026-05-06 17:53:20 UTC


#2 Concordance, symmetrization and non-exchangeability for bivariate copulas [PDF] [Copy] [Kimi] [REL]

Authors: Ávaro Rodríguez-García, Manuel Úbeda-Flores

We study the relationship between measures of non-exchangeability $μ_p$ ($p\in[1,+\infty]$), in the sense of Durante et al. (2010), and classical dependence functionals for bivariate copulas. We show that the symmetrization $C\mapsto(C+C^t)/2$ preserves Spearman's $ρ$ while annihilating $μ_p$, and that Blomqvist's $β$ carries no information about the degree of non-exchangeability. We also establish the sharp lower bound $σ(C)\ge 6\,μ_1(C)$, where $σ$ is the Schweizer-Wolff dependence measure, showing that asymmetry implies dependence. Closed-form expressions for $τ$, $ρ$, and the tail-dependence coefficients of the maximally non-exchangeable family $\{M_θ\}$ are derived as illustrations.

Subject: Statistics Theory

Publish: 2026-05-06 17:40:26 UTC


#3 Randompack: Cross-Platform Reproducible Random Number Generation and Distribution Sampling [PDF] [Copy] [Kimi] [REL]

Author: Kristján Jónasson

A C library for random number generation, Randompack, is presented. The library implements several modern random number generators (engines), including xoshiro256, PCG64, Philox, ranlux++, and sfc64; 14 continuous distributions including uniform, normal, exponential, gamma, beta, and multivariate normal; raw bit streams, bounded integers, permutations, and sampling without replacement. The engine and the distribution layers are separated so any engine can be used with any distribution. Benchmarks show that Randompack is faster overall than competing libraries, with speedup factors ranging from about 1 to 15 depending on engine, distribution, interface, and platform. A distinguishing feature is reproducibility: with the same seeds Randompack gives compatible results across programming languages, computers, CPU architectures, and compilers. The library includes comprehensive support for parallel simulation. It is accompanied by a comprehensive test suite, benchmarking programs, and example programs. Interfaces to Fortran, Python, Julia, and R have been implemented; their benchmark results are included, although their design and implementation are otherwise outside the scope of the article. Unlike other available C libraries with comparable scope, Randompack is permissively licensed under the MIT license, and it is open source and publicly available through GitHub and conda-forge.

Subject: Applications

Publish: 2026-05-06 16:35:08 UTC


#4 Proximal Projection for Doubly Sparse Regularized Models [PDF] [Copy] [Kimi] [REL]

Authors: Jia Wei He, R. Ayesha Ali, Gerarda Darlington

Regularization is often used in high-dimensional regression settings to generate a sparse model, which can save tremendous computing resources and identify predictors that are most strongly associated with the response. When the predictors can be represented by a Gaussian graphical model, the structure of the predictor graph can be exploited during regularization. Our proposed model exploits this underlying predictor graph structure by decomposing the estimated coefficient vector into a sum of latent variables that correspond to the sum of each node contribution to the coefficient vector. Regularization is then performed on the latent variables rather than on the coefficient vector directly. We use a penalty function that permits a clear user-defined trade-off between the L1 and L2 penalties and propose a novel proximal projection during optimization. Further, our implementation computes the projection operator for the intersection of selected groups, which conserves more computing resources compared to predictor duplication methods, especially for high-dimensional data. Through simulation, we evaluate the performance of our approach under different graph structures and node counts, and present results on real-world data. Results suggest that our method exhibits stable performance relative to other singly or doubly sparse graphical regression models.

Subjects: Machine Learning , Machine Learning , Computation , Methodology

Publish: 2026-05-06 16:31:20 UTC


#5 High-Dimensional Statistics: Reflections on Progress and Open Problems [PDF] [Copy] [Kimi] [REL]

Authors: Arian Maleki, Subhabrata Sen, Sivaraman Balakrishna, Verena Zuber, Chao Gao, Rishabh Dudeja, Christos Thrampoulidis, Anru Zhang, Weijie Su, Jason M. Klusowski, Po-Ling Loh, Ali Shojaie

Over the past two decades, the field of high-dimensional statistics has experienced substantial progress, driven largely by technological advances that have dramatically reduced the cost and effort for data collection and storage across a broad range of domains, including biology, medicine, astronomy, and the social and environmental sciences. Modern datasets are increasingly complex, often exhibiting rich dependency, heterogeneity, and other features that challenge traditional statistical methods. In response, high-dimensional statistics has evolved to address more sophisticated estimation and inference problems. This evolution has, in turn, fostered deep connections with and contributions to a wide range of research areas, including optimization, concentration of measure, random matrix theory, information theory, and theoretical computer science. Given the rapid pace of recent developments in high-dimensional statistics, our goal is to synthesize representative advances, highlight common themes and open problems, and point to important works that offer entry points into the field.

Subjects: Statistics Theory , Computation , Methodology , Machine Learning

Publish: 2026-05-06 16:11:09 UTC


#6 Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence [PDF] [Copy] [Kimi] [REL]

Authors: Shibo Yu, Yingzhou Wang, Yan Chen, Guodong Li, Jin-Hong Du

Pairwise comparisons from multiple judges are central to large language model evaluation and preference modeling, yet standard ranking pipelines often pool judgments into a single score vector, treating systematic judge disagreement as noise. We propose Heterogeneous Judge-Aware (HJA) ranking, a structured multi-judge ranking framework that separates consensus ranking, judge-specific sensitivity to consensus, and residual preference disagreement. HJA thereby treats ranking, judge sensitivity, and structured disagreement as separate inferential targets.We establish conditions under which this decomposition is identifiable and develop an anchored alternating algorithm that preserves the identifying geometry. For confidence quantification, we study a fixed-panel repeated-comparison regime in which the judge panel may remain fixed or modest while information grows through repeated judgments. This yields uncertainty statements for consensus and judge-specific ranking contrasts, sensitivity parameters, pairwise probabilities, and summaries of residual disagreement.Experiments on synthetic and real multi-judge comparison data show that HJA improves recovery, robustness, uncertainty calibration, and near-tie performance relative to pooled and sensitivity-only baselines. The fitted model also provides diagnostics for judge disagreement and model-affinity patterns, giving a statistically grounded framework for ranking under heterogeneous comparative judgments.

Subject: Methodology

Publish: 2026-05-06 16:09:55 UTC


#7 Impossibility of Distribution-Free Predictive Inference for Individual Treatment Effects [PDF] [Copy] [Kimi] [REL]

Authors: Chongguang Tao, Zheng Zhou, Yuhong Yang

Uncertainty quantification for individual treatment effects (ITEs) is a daunting challenge in causal inference. Motivated by recent advances in conformal prediction, several works aim to construct distribution-free prediction sets for ITEs with desired coverage under standard assumptions such as strong ignorability and overlap. In this paper, we show that such goals are fundamentally unattainable in the presence of continuous covariates. Specifically, we establish finite-sample and asymptotic impossibility results demonstrating that any distribution-free prediction set achieving desired coverage for ITEs must be trivial, in the sense that it has infinite expected length. Our analysis relies on a connection between ITE inference and the hardness of conditional independence testing, and highlights the intrinsic limitations imposed by the missing data nature of causal inference. These results provide a new perspective on existing methods, clarifying that their apparent success necessarily relies on additional structural assumptions beyond standard causal assumptions.

Subject: Methodology

Publish: 2026-05-06 15:48:15 UTC


#8 Hypergraph Generation via Structured Stochastic Diffusion [PDF] [Copy] [Kimi] [REL]

Author: Christopher Nemeth

Hypergraphs model higher-order interactions, but realistic hypergraph generation remains difficult because incidence, hyperedge-size heterogeneity, and overlap structure are not faithfully captured by pairwise reductions. We propose \HEDGE, a generative model defined directly on relaxed incidence matrices via a structured stochastic diffusion. The forward process combines a hypergraph-specific two-sided heat operator with an Ornstein--Uhlenbeck component, preserving structure-aware noising near the data while yielding an explicit Gaussian terminal law. Conditional on an observed hypergraph, this forward process is linear-Gaussian, so conditional means, covariances, scores, and reverse-drift targets are available in closed form. We therefore learn a permutation-equivariant state-only reverse-drift field in incidence space by regressing onto exact conditional targets, and generate samples by simulating a learned reverse-time SDE from the Gaussian base law. We establish exactness in the ideal state-only setting together with finite-horizon stability guarantees, and empirically show improved hypergraph generation quality relative to strong baselines.

Subjects: Machine Learning , Machine Learning , Computation , Methodology

Publish: 2026-05-06 15:19:20 UTC


#9 Scalable inference of spatial regions and temporal signatures from time series [PDF] [Copy] [Kimi] [REL]

Authors: Jiayu Weng, Alec Kirkley

Regionalization aims to partition a spatial domain into contiguous regions that share similar characteristics, enabling more effective spatial analysis, policy making, and resource management. Existing approaches for spatial regionalization typically rely on static spatial snapshots rather than evolving time series. Meanwhile, most time series clustering methods ignore spatial structure or enforce spatial continuity through ad hoc regularization, constraining the number of inferred regions a priori either explicitly or implicitly. Utilizing the minimum description length principle from information theory, here we propose an efficient and fully nonparametric framework for the regionalization of spatial time series. Our method jointly infers a spatial partition along with a set of representative time series archetypes ("drivers") that best compress a spatiotemporal dataset, with a runtime log-linear in the number of time series. We demonstrate that this method can accurately recover planted regional structure and drivers in synthetic time series, and can extract meaningful structural regularities in large-scale empirical air quality and vegetation index records. Our method provides a principled and scalable framework for spatially contiguous partitioning, allowing interpretable temporal patterns and homogeneous regions to emerge directly from the data itself.

Subjects: Machine Learning , Machine Learning , Social and Information Networks , Physics and Society

Publish: 2026-05-06 15:07:37 UTC


#10 A Tutorial for Evaluating Cure Model Appropriateness [PDF] [Copy] [Kimi] [REL]

Authors: A Tutorial for Evaluating Cure Model Appropriateness Geethanjalee Mudunkotuwa, Durbadal Ghosh, Subodh Selukar

In survival analysis, traditional models assume all individuals will eventually experience the event of interest. However, advances in therapeutics have led to multiple clinical contexts with potentially curative therapies, and in these contexts, certain individuals may never experience the event. Statisticians have developed cure models as a methodology to address this challenge. Nonetheless, despite significant statistical advances in cure models, we have seen more limited uptake in biomedical applications, and we hypothesize that this is caused by limited guidance in the appropriate application of cure models. Cure models require specific identifiability conditions for valid parameter estimation, and previous reports have demonstrated significant issues with the inappropriate application of cure models. Existing tutorials for cure models focus on model implementation and either assume or provide only limited guidance on whether cure modeling is appropriate for the given dataset. This tutorial addresses this gap by describing a systematic procedure that integrates clinical judgment, visual inspection of Kaplan-Meier curves, and quantitative evaluation. We provide a worked example using data from a randomized clinical trial in acute myeloid leukemia, and we also summarize findings from a series of other datasets of hematopoietic cell transplantation to suggest broad practical guidance for choosing to apply cure models. By systematically evaluating cure model appropriateness before fitting these models, researchers can achieve more reliable survival analysis and improved clinical decision-making.

Subject: Methodology

Publish: 2026-05-06 14:59:41 UTC


#11 Tests for white noise via asymptotically independent U-statistics in high-dimensions [PDF1] [Copy] [Kimi] [REL]

Author: Yuanya Xu

We propose a high-dimensional white noise test that captures serial correlations within and across component series without specifying an alternative model. The test statistic is a U-statistic based on sample autocovariances. Under the null, asymptotic normality is established as $p, T \to \infty$ jointly using martingale difference theory. Our approach imposes no cross-sectional independence assumption, requiring only spectral conditions on $Σ_0$. Theoretically, we link cross-sectional correlations to a graph structure, integrating algebraic and geometric analyses to facilitate the derivation. Simulations confirm reliable size control and satisfactory power across various $(p, T)$ settings.

Subjects: Methodology , Statistics Theory

Publish: 2026-05-06 14:26:50 UTC


#12 Jacobian-Velocity Bounds for Deployment Risk Under Covariate Drift [PDF] [Copy] [Kimi] [REL]

Author: Jonathan R. Landers

We study long-horizon deployment of a frozen predictor under dynamic covariate shift. A time-domain Poincaré inequality reduces temporal risk volatility to derivative energy, and a Jacobian-velocity theorem identifies directional tangent energy along the deployment path as the governing quantity under explicit along-path regularity and domination assumptions. Under low-rank drift, that quantity reduces to directional Jacobian energy in the drift subspace, motivating drift-aligned tangent regularization (DTR) and a matched monitoring proxy. Rather than smoothing the network isotropically, DTR penalizes sensitivity only along estimated drift directions. We validate the theorem-to-method pipeline in four experiments: a synthetic benchmark for the time-domain inequality, a controlled synthetic comparison against isotropic Jacobian regularization, and two frozen-deployment studies on the UCI Air Quality and Tetouan power-consumption datasets. DTR reduces risk volatility and directional gain in the controlled low-rank regime, beats isotropic smoothing there, and gives validation-selected deployment gains on both real datasets when the Air Quality drift subspace is estimated from target-orthogonal sensor motion. Moderate drift-subspace misspecification is tolerable while orthogonal misspecification largely removes the benefit.

Subjects: Machine Learning , Machine Learning

Publish: 2026-05-06 13:57:03 UTC


#13 A Convolution Process for Sea Surface Temperature Hot-Spot Identification in the Mediterranean Sea [PDF] [Copy] [Kimi] [REL]

Authors: Leonardo Marchesin, Alessandra Menafoglio, Piercesare Secchi

Sea surface temperature (SST) is a fundamental determinant of global climate dynamics and economic activity. Reliable projections of future SST patterns depend critically on a rigorous characterization of the underlying spatial random field. In this study, we introduce a novel convolution-based covariance framework tailored to geostatistical domains constrained by physical barriers and influenced by vector-driven flows. By discretizing the continuous marine domain into a directed linear network that preserves the orientation of ocean currents, we construct a moving-average stochastic process whose dynamic is encoded via a Markovian transition-probability matrix on the network's vertices. The induced covariance structure emerges as a weighted combination of a spatial kernel and flow-dependent weights, giving rise to a complex estimation problem. To stabilize inference, we propose a penalized estimator that regularizes covariance parameters while enforcing consistency with known hydrodynamic properties. We then embed this covariance model into a Monte Carlo simulation framework to refine RCP-based SST projections and to identify thermal 'hot spots' of heightened ecological risk. Our approach delivers a statistically principled framework that prevents physical inconsistencies -- such as correlations across land barriers -- providing a robust basis for quantifying uncertainty in future SST forecasts and for guiding targeted environmental assessments.

Subjects: Methodology , Applications

Publish: 2026-05-06 13:49:12 UTC


#14 PAIR-CI: Calibrated Conditional Independence Testing for Causal Discovery with Incomplete Data [PDF] [Copy] [Kimi] [REL]

Authors: Thomas S. Robinson, Ranjit Lall

The standard constraint-based paradigm for causal discovery with incomplete data -- impute first, test second -- is frequently miscalibrated: any consistent conditional independence (CI) test rejects a true null with probability approaching 1 when imputation error induces spurious conditional dependence. We introduce PAIR-CI, a nonparametric CI test that restores calibration by integrating multiple imputation directly into the inferential procedure via a paired permutation design. PAIR-CI compares cross-validated models that include and exclude the candidate variable while receiving the same imputed conditioning set, forcing imputation error to cancel in their loss difference rather than contaminate the test statistic. A provably consistent variance estimator jointly accounts for uncertainty arising from cross-validation and multiple imputation -- to our knowledge, the first formal unification of these two inferential frameworks. In simulations, existing imputation-based CI tests exhibit false positive rates of 28--45% when data are missing not at random (MNAR), whereas PAIR-CI averages below the nominal 5% level across data-generating processes and missingness mechanisms. These gains are largest in nonlinear settings and grow with causal graph size: when integrated into the PC algorithm, PAIR-CI reduces structural Hamming distance by 8% on 10-variable nonlinear graphs, 15% on 30-variable equivalents, and up to 44% on the 56-variable HAILFINDER network, with stable performance in all settings.

Subjects: Methodology , Machine Learning , Machine Learning

Publish: 2026-05-06 12:34:37 UTC


#15 Data anonymization in the presence of outliers via invariant coordinate selection [PDF] [Copy] [Kimi] [REL]

Authors: Katariina Perkonoja, Joni Virta

Protecting confidential data while preserving utility is particularly challenging when data sets contain outlying observations. Existing latent space anonymization methods, such as spectral anonymization (SA), rely on principal component analysis (PCA) and may therefore be vulnerable to contamination. We investigate anonymization in the presence of outliers and propose ICSA, a robust alternative to SA based on invariant coordinate selection (ICS). By replacing the PCA transformation with ICS, the robustness of the anonymization procedure can be regulated through the choice of scatter matrices. Alongside the methodological development, we derive a theoretical result showing that SA fails under sufficiently influential outliers. To assess the practical implications of this result, we compare the privacy-utility trade-off of ICSA and SA through simulation experiments under varying contamination settings and outlier severities. Our findings indicate that implementations of ICSA based on robust scatter matrices achieve stronger privacy protection than SA, while typically maintaining comparable, and in some cases improved, utility. We further examine the empirical performance of the proposed method using a benchmark clinical data set, where ICSA demonstrates superior overall privacy-utility efficiency relative to SA. These results suggest that explicitly accounting for outliers can materially improve anonymization performance and that robust latent space transformations offer a promising direction for privacy-preserving statistical data release.

Subjects: Methodology , Cryptography and Security

Publish: 2026-05-06 12:29:50 UTC


#16 Multiscale Euclidean Network Trajectories: Second-Moment Geometry, Attribution, and Change Points [PDF] [Copy] [Kimi] [REL]

Authors: Haruka Ezoe, Ryohei Hisano

A central challenge in dynamic network analysis is to represent temporal evolution in a way that is both geometrically meaningful and statistically identifiable. One approach embeds a sequence of network snapshots as trajectories in a Euclidean space and relates these trajectories to node embeddings. In multilayer and unfolded spectral constructions, however, node embeddings and their underlying latent positions are identifiable only up to general linear transformations. Although this ambiguity preserves edge probabilities, it can distort geometry and invalidate distance based temporal comparisons at both the trajectory and node-levels. We develop Multiscale Euclidean Network Trajectories (MENT), a framework for multiscale temporal trajectories based on second-moment geometry. By imposing an isotropic normalization on the anchor latent positions, we reduce the relevant ambiguity to orthogonal transformations and prevent distortion of the second-moment geometry. In this canonical representation, we define a trace variation distance and mode-wise variation distances along orthogonal directions, and use multidimensional scaling to obtain low-dimensional trajectories of time points at both global and mode-wise levels. The resulting trajectories support interpretation and inference. They admit mode-wise decompositions, support attribution of global and mode-wise temporal changes to nodes, and enable change point detection through 1D trajectories. We prove consistency of the proposed unfolded spectral embedding and of the induced temporal trajectories. Experiments on two synthetic and two real dynamic networks illustrate stable and interpretable recovery of temporal structure and show strong performance against existing change point detection baselines.

Subjects: Machine Learning , Machine Learning , Statistics Theory

Publish: 2026-05-06 07:39:54 UTC


#17 Transversality and Geometric Regularisation in Distributional Statistical Models [PDF] [Copy] [Kimi] [REL]

Author: R. Labouriau

The distributional statistical framework replaces classical probability densities by distribution-kernel pairs $(T, \varphi)$, where $T$ is a tempered distribution and $\varphi$ is a rapidly decaying kernel. We develop the thesis that the kernel acts as a geometric regulariser, placing parametric statistical models in generic (transversal) position relative to degeneracy loci encoding non-identifiability, singular information, moment indeterminacy, and representation failure. Using the transversality theorems of Whitney, Thom, and Mather, we prove a finite-dimensional weak transversality theorem: for a generic kernel in any sufficiently rich family, the kernel-induced feature map avoids degeneracy strata of sufficiently high codimension. We establish verifiable conditions -- formulated as rank conditions on the Jacobian of the joint feature map -- under which the transversality hypothesis can be checked, and verify them for location families, the log-normal, Stein discrepancies, and graphical models. The present results apply to parametric models; extensions to semiparametric and nonparametric settings are discussed. The degeneracy classification includes representation degeneracy (Type 0) for models without closed-form densities and higher-order instabilities (Type IV) in non-chordal graphical models. Identifiability, robustness, moment determinacy, Fisher information regularity, Stein discrepancy, inferential separation, and the Behrens-Fisher problem all admit a unified geometric interpretation as transversality conditions on the feature map. This paper serves as a geometric companion to a series of papers developing the distributional framework.

Subjects: Statistics Theory , Differential Geometry , Methodology

Publish: 2026-05-06 06:24:41 UTC


#18 Augmented transfer regression learning for completely missing covariates [PDF] [Copy] [Kimi] [REL]

Authors: Huali Zhao, Tianying Wang

Large-scale population-level datasets, such as the UK Biobank and the All of Us Research Program, often lack covariates needed for a specific analysis, such as genetic or lifestyle measures, while related studies measure them. This creates a cross-population missing data problem in which covariates are completely unobserved in the target population, rather than partially missing within one dataset. We propose an augmented transfer regression learning method for this setting. The key identifying condition is a sub-population shift assumption: the joint distribution of the outcome and observed covariates may differ across source and target populations, but the conditional distribution of the missing covariates given observed variables is invariant. We combine importance-weighted estimating equations with imputation terms for first- and second-order moments of the missing covariates. The resulting estimator is doubly robust, remaining consistent if either the density ratio model or both imputation models are correctly specified. It is $n^{1/2}$-consistent and asymptotically normal, and attains the semiparametric efficiency bound when both nuisance models are correctly specified.

Subjects: Methodology , Statistics Theory , Machine Learning

Publish: 2026-05-06 03:48:28 UTC


#19 Penalized KLIC Model Selection for the Generalized Method of Moments in Longitudinal Data with Time-Dependent Covariates [PDF] [Copy] [Kimi] [REL]

Authors: Hasan Mahmud, Muia Mathias Nthiani, Hamadou Mous-Abou, Ramezani Niloofar

Model selection plays an important role in longitudinal data analysis, especially when models are estimated using the generalized method of moments (GMM) in the presence of time-dependent covariates. In this setting, the number of valid moment conditions can grow quickly and may lead to over-parameterized models. The Kullback--Leibler Information Criterion (KLIC) has been proposed as a model-selection tool for this framework; however, the original KLIC criterion may favor overly complex models when the number of parameters or valid moment conditions increases. To address this limitation, this study proposes two penalized versions of KLIC that incorporate penalties based on both the number of model parameters and the number of valid moment conditions. The proposed criteria are referred to as the Moment--Parameter Product Penalty KLIC (MPPP--KLIC) and the Logarithmic Penalty KLIC (LP--KLIC). These criteria provide a theoretically motivated mechanism for balancing model fit and model complexity in GMM-based longitudinal models. Through an extensive simulation study involving both binary and continuous response settings, the proposed criteria are shown to improve the ability of KLIC to distinguish among competing models and to reduce the selection of over-parameterized models. The performance of the proposed methods is further illustrated using the Filipino Child Morbidity dataset, a longitudinal study of child health in the Philippines. The results show that the proposed penalized criteria provide stable and interpretable model rankings and consistently identify age as the most important predictor of child morbidity. Overall, the proposed penalized KLIC criteria offer practical and theoretically grounded tools for model selection in GMM-based longitudinal data analysis with time-dependent covariates.

Subjects: Methodology , Applications , Computation

Publish: 2026-05-06 03:33:08 UTC


#20 HIMCE: High-dimensional multiple imputation via covariance-mode updating for neuroimaging and spatiotemporal blocks [PDF] [Copy] [Kimi] [REL]

Authors: Hsin-Hsiung Huang, Stef van Buuren

High-dimensional neuroimaging and spatiotemporal blocks often contain structured missingness from acquisition artifacts, preprocessing failures, and sensor dropout. Multiple imputation propagates uncertainty, but fully conditional specification methods such as multivariate imputation by chained equations (MICE) can be slow or unstable when block dimension is large and correlations are strong. A multivariate normal (MVN) working model provides a coherent posterior predictive target and an exact data augmentation sampler, but repeated covariance sampling and matrix factorizations become costly in large dimensions. We propose High-dimensional Imputation via covariance Mode and Chained Equations (HIMCE), a hybrid multiple-imputation procedure for continuous blocks. Relative to exact MVN data augmentation, HIMCE preserves the Gaussian conditional imputation law and propagates mean- parameter uncertainty through stochastic coefficient or local-ridge draws. In high-dimensional blocks, it approximates covariance uncertainty through covariance-mode updating, optionally with a scalar bridge; in small blocks, it can restore exact covariance uncertainty through a conditional inverse-Wishart refresh. We record the exact Bayesian reference sampler and prove fixed-dimensional posterior consistency and asymptotic equivalence of mode plug-in prediction in total variation. We also develop diagnostics based on randomized rank-cell probability integral transform (PIT), PIT-consistent empirical coverage, and marginal distribution overlays. In the primary spatial benchmark, HIMCE improves posterior-mean error relative to HIMA and screened MICE, runs at HIMA-like speed and below half the MICE runtime, and improves interval coverage over HIMA, although MICE remains better calibrated. A repeated low- dimensional NHANES illustration shows improved coverage with competitive point prediction.

Subjects: Methodology , Computation

Publish: 2026-05-06 03:12:12 UTC


#21 Causal discovery under mean independence and linearity [PDF] [Copy] [Kimi] [REL]

Authors: Geert Mesters, Alvaro Ribot, Anna Seigal, Piotr Zwiernik

Causal discovery methods such as LiNGAM identify causal structure from observational data by assuming mutually independent disturbances. This assumption is fragile: shared volatility, common scale effects, or other forms of dependence can cause the methods to recover the wrong causal order, even with infinite data. We introduce the Linear Mean-Independent Acyclic Model (LiMIAM), which replaces full independence with weaker one-sided mean-independence restrictions on the disturbances. Under finite-order consequences of these restrictions, source nodes are generically identifiable, and hence a compatible causal order can be recovered recursively. Our proof is constructive and leads to DirectLiMIAM, a sequential residual-based algorithm for causal discovery under dependent noise. In simulations with mean-independent but dependent disturbances, DirectLiMIAM outperforms LiNGAM methods. A large-scale empirical application to the oil market highlights the implausibility of the independence assumption and the ability of DirectLiMIAM to recover a realistic causal ordering, from policy to production and from prices to inflation.

Subjects: Methodology , Machine Learning , Statistics Theory , Machine Learning

Publish: 2026-05-06 01:16:46 UTC


#22 A Zero-Inflated Beta Mixture Model for Marginal Mediation Analysis with Compositional Microbiome Mediators [PDF] [Copy] [Kimi] [REL]

Authors: Seungjun Ahn, Quran Wu, Alicia Yang, Zhigang Li

The role of the microbiome in disease pathogenesis is an emerging field with strong evidence suggesting that dysbiosis is associated with precancerous and cancerous states. Microbiome data present substantial challenges for causal mediation analysis due to sparsity, compositional constraints, and latent heterogeneity. To address these issues, we propose a zero-inflated beta mixture (ZIBM) method for mediation analysis with compositional microbiome mediators. The proposed method accommodates excess zeros through a zero-inflation component and captures heterogeneity in non-zero relative abundances using a beta mixture distribution. Within the potential-outcomes framework, the ZIBM provides estimates of marginal microbiome-mediated causal effects, and model parameters are estimated using an expectation-maximization algorithm. Simulation studies demonstrate that the ZIBM yields more accurate estimation and reliable inference under conditions commonly observed in microbiome data, compared with existing approaches. An application to a real microbiome study further illustrates its practical utility. These results indicate that the proposed method provides a more flexible and robust statistical framework for mediation analysis involving compositional microbiome data.

Subjects: Methodology , Quantitative Methods , Applications

Publish: 2026-05-06 00:35:25 UTC


#23 Perturbation is All You Need for Extrapolating Language Models [PDF] [Copy] [Kimi1] [REL]

Authors: Zetai Cen, Jin Zhu, Xinwei Shen, Chengchun Shi

We introduce a simple yet powerful framework for training large language models. In contrast to the standard autoregressive next-token prediction based on an exact prefix, we propose a perturbation-based procedure that first transforms the prefix into a semantic neighbor and then conditions on this perturbed variant for next-token prediction. This yields a hierarchical model with a pre-post-additive noise structure. Within this framework, we develop a rigorous theory of extrapolability, namely, the capacity of a model class to make reliable predictions for token sequences that lie outside the empirical support of the training corpus. We evaluate the finite-sample performance of the proposed procedure using both synthetic and real-world language data. Results show that the proposed method consistently improves out-of-support prediction while maintaining competitive in-support performance, demonstrating that perturbation offers a practical route to language modeling.

Subjects: Machine Learning , Machine Learning , Statistics Theory

Publish: 2026-05-05 23:03:33 UTC


#24 The Threshold Breakdown Point [PDF] [Copy] [Kimi] [REL]

Authors: Tianjun Ke, Marco Avella Medina

We introduce a novel approach to finite sample robustness that avoids the pessimism of traditional breakdown analyses. We define the threshold breakdown point, the smallest contamination fraction needed to induce a prescribed deviation, and the finite sample m-sensitivity, the worst-case deviation that an estimator can incur after m observations are contaminated. We derive these measures for commonly used M-estimators, their standard errors and related test statistics. This allows us to extend the decision breakdown point of Zhang (1996) to obtain general breakdown characterizations for hypothesis testing, and show how these notions correspond to finite sample counterparts of the power and level breakdown functions of He, Simpson and Portnoy (1990). We complement our work with an inferential framework for the threshold breakdown and m-sensitivity that yields consistency and asymptotic normality results, as well as a valid multiplier bootstrap for uncertainty quantification. We illustrate the practical utility of our methods in various numerical examples and an application to a two sample testing problem for a blood pressure dataset.

Subject: Statistics Theory

Publish: 2026-05-05 21:36:45 UTC


#25 Thinned Quantile Shares are Universally Feasible [PDF] [Copy] [Kimi] [REL]

Authors: Vishesh Jain, Clayton Mizgerd, Shyam Ravichandran

Quantile shares, introduced by Babichenko, Feldman, Holzman, and Narayan [STOC 2024], offer an ordinal, self-maximizing, and interpretable benchmark for fair division of indivisible goods, but their universal feasibility is known only conditional on the rainbow Erdős matching conjecture (EMC). Specifically, Babichenko et al. showed that assuming the rainbow EMC in the near-perfect matching regime, the $(1/2e)$-quantile share is universally feasible. In contrast, a simple argument shows that the $q$-quantile share can be infeasible for any $q > 1/e$. We introduce a one-parameter refinement of quantile shares, the $c$-thinned quantile share, obtained by thinning the inclusion probability in the random benchmark bundle by a factor of $c$ for a fixed constant $c\in(0,1]$. Our main result is that there exists a universal constant $c >0$ for which the $c$-thinned $e^{-c}$-quantile share is unconditionally universally feasible; this is best possible in the sense that for any $c \in (0,1]$, the $c$-thinned $q$-quantile share can be infeasible for any $q > e^{-c}$. Prior to this work, the only nontrivial share known to be universally feasible was Feige's residual maximin share. The thinning viewpoint also lets us remove the factor-two loss in the conditional result for the original quantile share: assuming the rainbow EMC, the $(1/e)$-quantile share is universally feasible.

Subjects: Statistics Theory , Discrete Mathematics , Computer Science and Game Theory , Combinatorics

Publish: 2026-05-05 21:07:23 UTC