Statistics

2026-05-15 | | Total: 65

#1 How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization [PDF2] [Copy] [Kimi4] [REL]

Authors: Leena Chennuru Vankadara, Moritz Haas, Luke Hayward, Sebastian Bordt, Alessandro Breccia

Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($μ$) desiderata. We then show that the resulting $μ$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $μ$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.

Subjects: Machine Learning , Machine Learning

Publish: 2026-05-13 23:32:00 UTC


#2 Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment [PDF] [Copy] [Kimi4] [REL]

Authors: Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim, Jeremy C. Weiss

Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.

Subjects: Computation and Language , Artificial Intelligence , Machine Learning , Machine Learning

Publish: 2026-05-14 17:55:27 UTC


#3 Logging Policy Design for Off-Policy Evaluation [PDF3] [Copy] [Kimi1] [REL]

Authors: Connor Douglas, Joel Persson, Foster Provost

Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective. We also distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.

Subjects: Machine Learning , Artificial Intelligence , Information Retrieval , Machine Learning , Methodology

Publish: 2026-05-14 17:25:19 UTC


#4 A Mutual Information Lower Bound for Multimodal Regression Active Learning [PDF2] [Copy] [Kimi1] [REL]

Authors: Leonardo Ferreira Guilhoto, Akshat Kaushal, Paris Perdikaris

Active learning for continuous regression has lacked an acquisition function that targets epistemic uncertainty when the predictive distribution is multimodal: variance misses modal disagreement, and information-theoretic targets like BALD are designed for discrete outputs. We introduce a Two-Index framework that makes this separation explicit: one stochastic index selects among competing model hypotheses (epistemic source), while a second governs within-hypothesis randomness (aleatoric source). An entropy decomposition within the framework identifies the mutual information between the output and the epistemic index as a principled acquisition objective, and we prove this quantity vanishes as the model is trained on growing datasets, confirming that it captures exactly the uncertainty data can resolve. Because this mutual information is intractable for continuous outputs, we derive the Mutual Information Lower Bound (MI-LB) acquisition function, a closed-form approximation for Mixture Density Network ensembles. On benchmarks featuring multimodal systems, MI-LB matches or beats every baseline evaluated and is the only method to do so consistently -- geometric and Fisher-based baselines compete only when the input space already encodes the multimodality, and collapse otherwise.

Subjects: Machine Learning , Computational Engineering, Finance, and Science , Information Theory , Machine Learning

Publish: 2026-05-14 14:50:47 UTC


#5 Training-Free Generative Sampling via Moment-Matched Score Smoothing [PDF1] [Copy] [Kimi2] [REL]

Authors: Zhenyu Yao, Daniel Paulin

Diffusion models generate samples by denoising along the score of a perturbed target distribution. In practice, one trains a neural diffusion model, which is computationally expensive. Recent work suggests that score matching implicitly smooths the empirical score, and that this smoothing bias promotes generalization by capturing low-dimensional data geometry. We propose moment-matched score-smoothed overdamped Langevin dynamics (MM-SOLD), a training-free interacting particle sampler that enforces the target moments throughout the sampling trajectory. We prove that, in the large-particle limit, the empirical particle density converges to a deterministic limit whose one-particle stationary marginal is a Gibbs--Boltzmann density obtained by exponentially tilting a naive score-smoothed diffusion target. The mean and covariance of this distribution agree with the empirical moments of the training data. Experiments on 2D distributions and latent-space image generation show that MM-SOLD enables fast, robust, training-free sampling on CPUs, with sample fidelity and diversity competitive with neural diffusion baselines.

Subjects: Machine Learning , Machine Learning

Publish: 2026-05-14 02:20:36 UTC


#6 Finite Sample Bounds for Learning with Score Matching [PDF1] [Copy] [Kimi1] [REL]

Authors: Devin Smedira, Abhijith Jayakumar, Sidhant Misra, Marc Vuffray, Andrey Y. Lokhov

Learning of continuous exponential family distributions with unbounded support remains an important area of research for both theory and applications in high-dimensional statistics. In recent years, score matching has become a widely used method for learning exponential families with continuous variables due to its computational ease when compared against maximum likelihood estimation. However, theoretical understanding of the statistical properties of score matching is still lacking. In this work, we provide a non-asymptotic sample complexity analysis for learning the structure of exponential families of polynomials with score matching. The derived sample bounds show a polynomial dependence on the model dimension. These bounds are the first of its kind, as all prior work has shown only asymptotic bounds on the sample complexity.

Subjects: Machine Learning , Data Structures and Algorithms , Machine Learning

Publish: 2026-05-13 22:48:18 UTC


#7 Covariance-aware sampling for Diffusion Models [PDF1] [Copy] [Kimi1] [REL]

Authors: Andrea Schioppa, Tim Salimans

We present a covariance-aware sampler that improves the quality of pixel-space Diffusion Model (DM) sampling in the few-step regime. We hypothesize that in the few-step regime samplers fail because they rely solely on the predicted mean of the reverse distribution, while our solution explicitly models the reverse-process covariance. Our method combines Tweedie's formula to estimate the covariance with an efficient, structured Fourier-space decomposition of the covariance matrix. Implemented as an extension of DDIM, our method requires only a minimal overhead: one extra Jacobian-Vector Product (JVP) per step. We demonstrate that for pixel-based DMs, our method consistently produces superior samples compared to state-of-the-art second order samplers (Heun, DPM-Solver++) and the recent aDDIM sampler, at an identical number of function evaluations (NFE).

Subjects: Machine Learning , Computer Vision and Pattern Recognition , Machine Learning

Publish: 2026-05-13 07:46:06 UTC


#8 TabPFN-3: Technical Report [PDF] [Copy] [Kimi1] [REL]

Authors: Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Mihir Manium, Shi Bin, Hoo, Magnus Bühler, Anurag Garg, Dominik Safaric, Jake Robertson, Benjamin Jäger, Simone Alessi, Adrian Hayler, Vladyslav Moroshan, Lennart Purucker, Philipp Singer, Alan Arazi, Julien Siems, Jan Hendrik Metzen, Georg Grab, Nick Erickson, Siyuan Guo, Eliott Kalfon, Simon Bing, David Salinas, Clara Cornu, Lilly Charlotte Wehrhahn, Diana Kriuchkova, Kursat Kaya, Lydia Sidhoum, Marie Salmon, Jerry Chen, Madelon Hulsebos, Yann LeCun, Samuel Müller, Bernhard Schölkopf, Sauraj Gambhir, Noah Hollmann, Frank Hutter et al. (12 additional authors not shown)

Tabular data underpins most high-value prediction problems in science and industry, and TabPFN has driven the foundation model revolution for this modality. Designed with feedback from our users, TabPFN-3 builds on this foundation to scale state-of-the-art performance to datasets with 1M training rows and substantially reduce training and inference time. Pretrained exclusively on synthetic data from our prior, TabPFN-3 dramatically pushes the frontier of tabular prediction and brings substantial gains on time series, relational, and tabular-text data. On the standard tabular benchmark TabArena, a forward pass of TabPFN-3 outperforms all other models, including tuned and ensembled baselines, by a significant margin, and pareto-dominates the speed/performance frontier. On more diverse datasets, TabPFN-3 ranks first on datasets with many classes, and beats 8-hour-tuned gradient-boosted-tree baselines on datasets up to 1M training rows and 200 features. TabPFN-3 introduces test-time compute scaling to tabular foundation models. Our API offering TabPFN-3-Plus (Thinking) exploits this to beat all non-TabPFN models by over 200 Elo on TabArena, rising to 420 Elo on the largest data subset, and outperforms AutoGluon 1.5 extreme while being 10x faster, without using LLMs, real data, internet search or any other model besides TabPFN. TabPFN-3 extends the capabilities of our models, enabling SOTA prediction on relational data (new SOTA foundation model on RelBenchV1) and tabular-text data (SOTA on TabSTAR via TabPFN-3-Plus); and improves existing integrations: a specialized checkpoint, TabPFN-TS-3, ranks 2nd on the time-series benchmark fev-bench, and SHAP-value computation is up to 120x faster. TabPFN-3 achieves this performance while being up to 20x faster than TabPFN-2.5. In addition, a reduced KV cache and row-chunking scale to 1M rows on one H100 with fast inference speed.

Subjects: Machine Learning , Machine Learning

Publish: 2026-05-13 18:01:43 UTC


#9 Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients [PDF] [Copy] [Kimi1] [REL]

Authors: Matias Alvo, Daniel Russo, Yash Kanoria

We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form, further broadening its applicability. Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows. Finally, we characterize the structure of the mixed gradient, showing that its cross term -- which captures how continuous actions influence future discrete decisions -- becomes negligible near a discrete best response, thereby enabling approximate decentralized updates of the continuous and discrete components and reducing variance near optimality. All resources are available at github.com/MatiasAlvo/hybrid-rl.

Subjects: Machine Learning , Artificial Intelligence , Optimization and Control , Machine Learning

Publish: 2026-05-14 02:59:45 UTC


#10 Fast Rates for Inverse Reinforcement Learning [PDF1] [Copy] [Kimi] [REL]

Authors: Andreas Schlaginhaufen, Maryam Kamgarpour

We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimation (MLE) and Min-Max-IRL are equivalent at the population level, and at the empirical level under deterministic dynamics. On the statistical side, exploiting pseudo-self-concordance of the Min-Max-IRL loss, we prove that both the trajectory-level KL divergence and the squared parameter error in the Hessian norm decay at the fast rate $\mathcal{O}(n^{-1})$, where $n$ is the number of expert trajectories. Our guarantees apply under misspecification and require no exploration assumptions. We further extend reward-identifiability results to general Borel spaces and derive novel results on the derivatives of the soft-optimal value function with respect to reward parameters.

Subjects: Machine Learning , Artificial Intelligence , Machine Learning

Publish: 2026-05-14 09:07:31 UTC


#11 A Practical Guide to Instrumental Variables Methods with Heterogeneous Treatment Effects [PDF1] [Copy] [Kimi] [REL]

Authors: Tymon Słoczyński, Liyang Sun, S. Derya Uysal

Instrumental variables (IV) methods are central to applied microeconomics. While classical approaches assume linear models with constant effects, recent literature has shifted toward the local average treatment effect (LATE) framework to accommodate heterogeneous treatment effects. This paper provides a practical guide to aligning empirical practice with recent theory. We first examine how different specifications with covariates lead to distinct weighted averages of covariate-specific LATEs. We then discuss how parametric misspecification can undermine the causal interpretation of these estimands and suggest flexible specifications as essential robustness checks. Finally, we review formal tests for LATE assumptions and methods robust to monotonicity violations. We provide a guide to software implementations to help researchers apply the methods in practice.

Subjects: Econometrics , Methodology

Publish: 2026-05-14 17:29:26 UTC


#12 AIS: Adaptive Importance Sampling for Quantized RL [PDF1] [Copy] [Kimi] [REL]

Authors: Jiajun Zhou, Wei Shao, Lingchao Zheng, Yuwei Fan, Ngai Wong

Reinforcement learning (RL) for large language models (LLMs) is dominated by the cost of rollout generation, which has motivated the use of low-precision rollouts (e.g., FP8) paired with a BF16 trainer to improve throughput and reduce memory pressure. This introduces a rollout-training mismatch that biases the policy gradient and can cause training to collapse outright on reasoning benchmarks. We show that the mismatch is non-stationary and acts as a double-edged sword: early in training it provides a stochastic exploration bonus, exposing the gradient to trajectories the trainer would otherwise under-sample, but the same perturbation transitions into a destabilizing source of bias as the policy concentrates. To solve this, we propose Adaptive Importance Sampling (AIS), a correction framework that adjusts the strength of its intervention on a per-batch basis. AIS combines three real-time diagnostics, namely weight reliability, divergence severity, and variance amplification, into a single mixing coefficient that interpolates between the uncorrected and fully importance-weighted gradients, suppressing the destabilizing component of the mismatch while preserving its exploratory benefit. We integrate AIS into GRPO and evaluate it on the diffusion-based LLaDA-8B-Instruct and the autoregressive Qwen3-8B and Qwen3.5-9B across mathematical reasoning and planning benchmarks. AIS matches the BF16 baseline on most tasks while retaining the 1.5 to 2.76x rollout speedup of FP8.

Subjects: Machine Learning , Artificial Intelligence , Machine Learning

Publish: 2026-05-13 03:36:57 UTC


#13 Wahkon: A Statistically Principled Deep RKHS Superposition Network [PDF1] [Copy] [Kimi] [REL]

Authors: Yongkai Chen, Wenxuan Zhong, Ping Ma

Deep learning excels at prediction but often lacks finite-sample guarantees and calibrated uncertainty; RKHS (Reproducing Kernel Hilbert Space)-based methods provide those guarantees but struggle to adapt in high dimensions. We propose Wahkon, a deep RKHS superposition network that unifies Kolmogorov's superposition principle with RKHS regularization in the smoothing-spline tradition of Wahba. This yields a finite-dimensional deep representer theorem that makes training tractable and provides explicit layerwise complexity control. We show the penalized estimator is exactly the MAP (maximum a posteriori) estimate under a hierarchical Gaussian-process prior, extending the spline/GP duality to deep compositions. Using metric-entropy arguments, we establish minimax-optimal convergence rates under mild smoothness and clarify how depth and width trade off with regularity. Empirically, Wahkon outperforms multilayer perceptrons, Neural Tangent Kernels, and Kolmogorov--Arnold Networks across simulation benchmarks and a single-cell CITE-seq study. By unifying Kolmogorov's superposition principle with RKHS regularization, Wahkon delivers accuracy, interpretability, and statistical rigor in a single framework.

Subjects: Methodology , Machine Learning

Publish: 2026-05-13 19:01:59 UTC


#14 Equilibrium and Pricing in Consumer Networks with Nonlinear Utilities: An Online Shape-Constrained Learning Approach [PDF1] [Copy] [Kimi] [REL]

Authors: Daniele Bracale, George Michailidis

We study optimal monopoly pricing over consumer networks governed by general nonlinear utilities. In our framework, a consumer's utility is jointly determined by an individualized price and the consumption choices of their peers, propagated through a directed and signed social graph. This formulation encapsulates a broad class of utility functions; it strictly generalizes the traditional linear-quadratic framework to include logit-type discrete choice, isoelastic, and Stone-Geary utilities under a single theoretical umbrella. We first establish the existence and uniqueness of the consumer-side equilibrium under general contraction and variational conditions, explicitly accommodating asymmetric and signed network externalities. Leveraging this equilibrium characterization, we analyze targeted price discrimination within community-structured and influencer-driven markets. To this end, we introduce a generalized measure of network influence that extends classical Katz-Bonacich centrality beyond the Euclidean domain. Finally, addressing the challenge of unknown consumer utility functions, we develop a shape-constrained, tuning-parameter-free learning approach utilizing isotonic regression, for which we establish strict no-regret convergence guarantees. Supported by extensive simulations, our results seamlessly integrate equilibrium analysis and nonparametric learning into a cohesive monopoly pricing framework.

Subject: Statistics Theory

Publish: 2026-05-13 23:19:05 UTC


#15 On the Burden of Achieving Fairness in Conformal Prediction [PDF1] [Copy] [Kimi] [REL]

Authors: Ziang Gao, Pengqi Liu, Archer Yi Yang, Mouloud Belbahri, Jesse C. Cresswell, Masoud Asgharian

Conformal prediction is often calibrated with a single pooled threshold, but this can hide cross-group heterogeneity in score distributions and distort group-wise coverage. We study this phenomenon through the population score distributions underlying split conformal calibration. First, we derive a conservation law and lower bound showing that pooled calibration incurs irreducible group-wise coverage distortion at a scale set by cross-group quantile heterogeneity. Second, we demonstrate that the two leading fairness definitions for conformal prediction, Equalized Coverage and Equalized Set Size, are fundamentally in tension. Third, we quantify the cost of moving between policies which treat groups separately or pool them. Experiments on synthetic and real data confirm the same bidirectional trade-off after finite-sample calibration. Our results show that, for the policy families studied here, calibration choice does not remove cross-group heterogeneity; it determines whether the resulting distortion appears in the coverage or size dimension, providing a principled lens for analyzing fairness-oriented calibration choices in practice.

Subjects: Machine Learning , Machine Learning

Publish: 2026-05-14 02:02:06 UTC


#16 Adaptive Long-Run Variance Thresholding for Sparse Covariance Estimation in High-Dimensional Time Series [PDF1] [Copy] [Kimi] [REL]

Authors: Wenhao Zhang, Zhaoxing Gao

Estimating a sparse covariance matrix is a fundamental problem in high-dimensional statistics. However, thresholding methods developed for independent data are generally not directly applicable to high-dimensional time series, where temporal dependence alters the stochastic behavior of sample covariance estimators. This paper studies sparse covariance matrix estimation for high-dimensional time series under weak dependence. We propose a thresholding procedure that incorporates long-run variance into the construction of entry-specific thresholds, thereby adapting to temporal dependence. Under suitable regularity conditions, we show that the proposed estimator is consistent under the spectral norm and attains the optimal convergence rate over a class of sparse covariance matrices. We further establish support recovery consistency for identifying the nonzero entries of the covariance matrix. In addition, we show that universal and adaptive thresholding methods developed for independent data may fail to recover the support consistently in the presence of autocorrelation. Simulation studies demonstrate that the proposed method compares favorably with existing thresholding estimators in terms of both estimation accuracy and support recovery. Applications to gene expression data and stock return data further illustrate its practical usefulness.

Subjects: Methodology , Statistics Theory

Publish: 2026-05-14 07:30:38 UTC


#17 Large Dimensional Kernel Ridge Regression: Extending to Product Kernels [PDF] [Copy] [Kimi1] [REL]

Authors: Yang Zhou, Yicheng Li, Yuqian Cheng, Qian Lin

Recent studies have reported $\textit{saturation effects}$ and $\textit{multiple descent behavior}$ in large dimensional kernel ridge regression (KRR). However, these findings are predominantly derived under restrictive settings, such as inner product kernels on sphere or strong eigenfunction assumptions like hypercontractivity. Whether such behaviors hold for other kernels remains an open question. In this paper, we establish a broad, new family of large dimensional kernels and derive the corresponding convergence rates of the generalization error. As a result, we recover key phenomena previously associated with inner product kernels on sphere, including: $i)$ the $\textit{minimax optimality}$ when the source condition $s\le 1$; $ii)$ the $\textit{saturation effect}$ when $s>1$; $iii)$ a $\textit{periodic plateau phenomenon}$ in the convergence rate and a $\textit {multiple-descent behavior}$ with respect to the sample size $n$.

Subjects: Machine Learning , Machine Learning

Publish: 2026-05-14 08:08:09 UTC


#18 Piece-wise linear isotonic regression [PDF1] [Copy] [Kimi] [REL]

Authors: Timo Kuosmanen, Juan F. Monge, José L. Ruiz, Xun Zhou

Isotonic regression provides a flexible, tuning-free approach to estimating monotonic functions without imposing global curvature constraints, yet the estimated regression function is inherently a step function. This paper addresses a key limitation of such estimators: their inability to provide meaningful marginal properties, such as shadow prices or elasticities. We propose a novel piece-wise linear smoothing framework that recovers meaningful marginal estimates even in non-convex settings. Building on the concept of conditional convexity originally developed in deterministic frontier analysis, we formulate the smoothing process as a bilevel optimization problem that fits a continuous, monotonic, piece-wise linear function to the initial isotonic regression predictions. Monte Carlo simulations demonstrate that the proposed approach can significantly improve estimation precision, reducing mean squared error in both convex and non-convex settings for univariate and multivariate data. We apply this approach to analyze agglomeration economies in Finnish municipalities, illustrating its practical value.

Subject: Methodology

Publish: 2026-05-14 15:16:44 UTC


#19 RoSHAP: A Distributional Framework and Robust Metric for Stable Feature Attribution [PDF1] [Copy] [Kimi] [REL]

Authors: Lanxin Xiang, Liang Shi, Youhui Ye, Boyu Jiang, Dawei Zhou, Feng Guo

Feature attribution analysis is critical for interpreting machine learning models and supporting reliable data-driven decisions. However, feature attribution measures often exhibit stochastic variation: different train--test splits, random seeds, or model-fitting procedures can produce substantially different attribution values and feature rankings. This paper proposes a framework for incorporating stochastic nature of feature attribution and a robust attribution metric, RoSHAP, for stable feature ranking based on the SHAP metric. The proposed framework models the distribution of feature attribution scores and estimates it through bootstrap resampling and kernel density estimation. We show that, under mild regularity conditions, the aggregated feature attribution score is asymptotically Gaussian, which greatly reduces the computational cost of distribution estimation. The RoSHAP summarizes the distribution of SHAP into a robust feature-ranking criterion that simultaneously rewards features that are active, strong, and stable. Through simulations and real-data experiments, the proposed framework and RoSHAP outperform standard single-run attribution measures in identifying signal features. In addition, models built using RoSHAP-selected features achieve predictive performance comparable to full-feature models while using substantially fewer predictors. The proposed RoSHAP approach improves the stability and interpretability of machine learning models, enabling reliable and consistent insights for analysis.

Subjects: Machine Learning , Machine Learning

Publish: 2026-05-14 17:51:09 UTC


#20 Three ways to find comfort with the Bell proof and the results of the Bell experiments [PDF] [Copy] [Kimi] [REL]

Authors: Richard D Gill, Inge S. Helland, Bart Jongejan

Bell's theorem states that no description of a Bell experiment can be simultaneously local, realistic in the sense of counterfactual definiteness, and free of conspiracy between settings and hidden state. The recent generation of experiments has confirmed the predicted violation of the CHSH inequality, so one of the assumptions must be abandoned. Which one, and how one reconstructs a coherent worldview after doing so, is a question on which many authors disagree. This paper is written by three such authors. All three reject both counterfactual definiteness and conspiratorial violation of statistical independence of setting choices and state. After a joint exposition of the classical half of Bell's theorem in the language of Pearl-style causal graphs, a joint summary of the loophole-free experiments, and a joint survey of the recent literature, each author states where they have presently arrived. Gill accepts irreducible and non-local quantum randomness and finds the choice between locality and realism a false dichotomy. In his later works, Bell derives counterfactual definiteness from classical local causality, and that is what has to go. The metaphysical concepts "realism", "locality", "causality" need to be reconsidered. Helland reconstructs the Hilbert-space formalism from a theory of accessible variables, and from this theory he concludes that every observer must be limited in a specific sense. Jongejan proposes a geometric hidden-variable construction in which the degree of violation of the CHSH inequality depends on the number of dimensions of space, Tsirelson's bound corresponding to three dimensions. The authors conclude with a discussion.

Subject: Quantum Physics

Publish: 2026-05-13 08:19:46 UTC


#21 XAI and Statistical Analysis for Reliable Intrusion Detection in the UAVIDS-2025 Dataset: From Tree to Hybrid and Tabular DNN Ensembles [PDF] [Copy] [Kimi] [REL]

Authors: Iakovos-Christos Zarkadis, Christos Douligeris

During the last few years, the term Mechanistic Interpretability, a specific area, under the umbrella of explainable artificial intelligence (XAI), has been introduced, to explain the decisions made by complex machine learning (ML) models in critical systems like UAV intrusion detection systems (UAVIDS). In this paper, we apply best-practices for data pre-processing and examine a wide range of tree-ensembles, deep neural networks, hybrid stacking models and the latest ensemble neural networks to detect intrusions in UAV, with stratified 10-fold cross validation. With our top-performing model, XGBoost, we proceed to Shapley Additive explanations (SHAP), to analyze the global and local feature importances and understand which features, each attack targets, to mimic normal traffic and where the misclassifications occur. Furthermore a distribution analysis follows, by visually comparing violin plots and the curves of kernel density estimations. With the Westfall-Young permutation test for multiple comparisons, the Bandwidth optimization of the KDEs and the selection of Jensen-Shannon Distance for the test, we discover the true causes of false predictions, observed in Wormhole and Blackhole attacks in UAVIDS-2025. The findings provide robust, reliable and explainable models for UAV intrusion detection, along with statistical insights, which capture and clarify the masked nature of the attacks, regarding the challenge of Density Support Intersection, between these attacks, in this dataset.

Subjects: Cryptography and Security , Machine Learning , Computation

Publish: 2026-05-13 14:08:36 UTC


#22 Unsupervised learning of acquisition variability in structural connectomes via hybrid latent space modeling [PDF] [Copy] [Kimi] [REL]

Authors: Gaurav Rudravaram, Lianrui Zuo, Karthik Ramadass, Elyssa McMaster, Jongyeon Yoon, Aravind R. Krishnan, Adam M. Saunders, Chenyu Gao, Nancy R. Newlin, Praitayini Kanakaraj, Lori L. Beason Held, Murat Bilgel, Laura A. Barquero, Micah DArchangel, Tin Q. Nguyen, Laurie B. Cutting, Derek Archer, Timothy J. Hohman, Daniel C. Moyer, Bennett A. Landman

Acquisition differences across sites, scanners, and protocols in dMRI introduce variability that complicates structural connectome analysis. This motivates deep learning models that can represent high-dimensional connectomes in a low-dimensional space while explicitly separating acquisition-related effects from biological variation. Conventional dimensionality reduction methods model all variance as continuous, so acquisition effects often get absorbed into a continuous latent space. Recent hybrid latent-space models combine discrete and continuous components to address this, but typically require manual capacity tuning to ensure the discrete component captures the intended variability. We introduce an unsupervised framework that removes this manual tuning by architecturally annealing encoder outputs before decoding, allowing the model to adaptively balance discrete and continuous latent variables during training. To evaluate it, we curated a dataset of N=7,416 structural connectomes derived from dMRI, spanning ages 2 to 102 and 13 studies with 25 unique acquisition-parameter combinations. Of these, 5,900 are cognitively unimpaired, 877 have mild cognitive impairment (MCI), and 639 have Alzheimer's disease (AD). We compare against a standard VAE, PCA with k-means clustering, and hybrid models that anneal only through the loss function. Our architectural annealing produces stronger site learning (ARI=0.53, p<0.05) than these baselines. Results show that a hybrid continuous-discrete latent space, with architectural rather than loss-based annealing, provides a useful unsupervised mechanism for capturing acquisition variability in dMRI: by jointly modeling smooth and categorical structure, the Joint-VAE recovers clusters aligned with scanner and protocol differences.

Subjects: Machine Learning , Artificial Intelligence , Machine Learning

Publish: 2026-05-13 16:11:49 UTC


#23 Winning Lottery Tickets in Neural Networks via a Quantum-Inspired Classical Algorithm [PDF] [Copy] [Kimi] [REL]

Authors: Natsuto Isogai, Hayata Yamasaki, Sho Sonoda, Mio Murao

Quantum machine learning (QML) aims to accelerate machine learning tasks by exploiting quantum computation. Previous work studied a QML algorithm for selecting sparse subnetworks from large shallow neural networks. Instead of directly solving an optimization problem over a large-scale network, this algorithm constructs a sparse subnetwork by sampling hidden nodes from an optimized probability distribution defined using the ridgelet transform. The quantum algorithm performs this sampling in time $O(D)$ in the data dimension $D$, whereas a naive classical implementation relies on handling exponentially many candidate nodes and hence takes $\exp[O(D)]$ time. In this work, we construct and analyze a quantum-inspired fully classical algorithm for the same sampling task. We show that our algorithm runs in time $O(\operatorname{poly}(D))$, thereby removing the exponential dependence on $D$ from the previous classical approach. Numerical simulations show that the proposed sampler achieves empirical risk comparable to exact sampling from the optimized distribution and substantially lower than sampling from the non-optimized uniform distribution, while also exhibiting exponentially improved runtime scaling compared with the conventional classical implementation. These successful dequantization results show that sparse subnetwork selection via optimized sampling can be achieved classically with polynomial data-dimension scaling on conventional computers without quantum hardware, providing an alternative to the existing quantum algorithm.

Subjects: Quantum Physics , Machine Learning , Machine Learning

Publish: 2026-05-13 18:00:56 UTC


#24 Regret Equals Covariance: A Closed-Form Characterization for Stochastic Optimization [PDF] [Copy] [Kimi] [REL]

Author: Irene Aldridge

Regret is the cost of uncertainty in algorithmic decision-making. Quantifying regret typically requires computationally expensive simulation via Sample Average Approximation (SAA), with complexity $\mathcal{O}(Bn^{2}d^{3})$ in the number of scenarios $B$, variables $n$, and constraints $d$. % This paper proves that expected regret in any stochastic optimization problem admits the exact decomposition % \begin{equation*} \mathrm{Regret}(c) = \mathrm{Cov}(c,\,π^{*}(c)) + R(c), \end{equation*} % where $c$ is the vector of uncertain parameters, $π^{*}(c)$ is the optimal decision, and $R(c)$ is a residual whose magnitude we bound explicitly under Lipschitz, smooth, and strongly convex conditions. % For linear programs and unconstrained quadratic programs, including the classical Markowitz portfolio problem, we prove $R(c)=0$ exactly, so that $\mathrm{Regret}(c) = \mathrm{Cov}(c,π^{*}(c))$ holds without approximation. % When historical cost-decision pairs $\{(c_i, π^*(c_i))\}$ are available, the covariance can be estimated in $\mathcal{O}(nd^{2})$ time, which is orders of magnitude faster than SAA. The estimation is performed by a single pass through the data. % We derive concentration bounds, a central limit theorem, and an asymptotically unbiased residual estimator, and we validate all results on synthetic LP, QP, and integer programming instances and on a rolling-window portfolio experiment using ten years of CRSP equity data.

Subjects: Econometrics , Machine Learning , Statistics Theory , Computation

Publish: 2026-05-13 18:32:44 UTC


#25 Finite-size scaling of hetero-associative retrieval in continuous-signal-driven Ising spin systems [PDF] [Copy] [Kimi] [REL]

Author: Andrea Ladiana

Real-world physical signals are continuous and high-dimensional, yet the statistical-mechanics machinery of associative memory operates on discrete Ising spins. We bridge this divide through a multilayer Ising framework that couples a geometry-preserving continuous-to-Ising encoder (PCA whitening composed with SimHash random-hyperplane projection) to Kanter-Sompolinsky pseudo-inverse memory couplings, embedded directly into the local-field equations of a tri-layer hetero-associative system. The pseudo-inverse correction renders the equal-weight mixture state thermodynamically unstable, so that thermal fluctuations break the cross-modal symmetry and select a single global winner. We further establish a dynamical duality: parallel (Little) updates are structurally required to ignite the cross-modal signal avalanche from a single cued layer, whereas sequential (Glauber) sweeps resolve symmetric superpositions. The operational storage capacity obeys the Amit-Gutfreund-Sompolinsky finite-size correction $α_c(N)=α_c(\infty)-c\,N^{-1/2}$, extrapolating to an asymptotic operational limit $α_c(\infty)\approx 0.50$ under macroscopic-basin retrieval. Applied to multi-channel sleep polysomnography (PhysioNet Sleep-EDF), the architecture reconstructs the macroscopic sleep state on parietal EEG and EOG axes from a single noisy frontal-EEG cue, demonstrating cross-modal recall in the presence of quenched biological disorder.

Subjects: Disordered Systems and Neural Networks , Machine Learning

Publish: 2026-05-13 19:33:51 UTC