Statistics

2025-12-08 | | Total: 39

#1 Consequences of Kernel Regularity for Bandit Optimization [PDF] [Copy] [Kimi] [REL]

Authors: Madison Lee, Tara Javidi

In this work we investigate the relationship between kernel regularity and algorithmic performance in the bandit optimization of RKHS functions. While reproducing kernel Hilbert space (RKHS) methods traditionally rely on global kernel regressors, it is also common to use a smoothness-based approach that exploits local approximations. We show that these perspectives are deeply connected through the spectral properties of isotropic kernels. In particular, we characterize the Fourier spectra of the Matérn, square-exponential, rational-quadratic, $γ$-exponential, piecewise-polynomial, and Dirichlet kernels, and show that the decay rate determines asymptotic regret from both viewpoints. For kernelized bandit algorithms, spectral decay yields upper bounds on the maximum information gain, governing worst-case regret, while for smoothness-based methods, the same decay rates establish Hölder space embeddings and Besov space norm-equivalences, enabling local continuity analysis. These connections show that kernel-based and locally adaptive algorithms can be analyzed within a unified framework. This allows us to derive explicit regret bounds for each kernel family, obtaining novel results in several cases and providing improved analysis for others. Furthermore, we analyze LP-GP-UCB, an algorithm that combines both approaches, augmenting global Gaussian process surrogates with local polynomial estimators. While the hybrid approach does not uniformly dominate specialized methods, it achieves order-optimality across multiple kernel families.

Subjects: Machine Learning , Machine Learning

Publish: 2025-12-05 18:54:09 UTC


#2 Designing an Optimal Sensor Network via Minimizing Information Loss [PDF1] [Copy] [Kimi] [REL]

Authors: Daniel Waxman, Fernando Llorente, Katia Lamer, Petar M. Djurić

Optimal experimental design is a classic topic in statistics, with many well-studied problems, applications, and solutions. The design problem we study is the placement of sensors to monitor spatiotemporal processes, explicitly accounting for the temporal dimension in our modeling and optimization. We observe that recent advancements in computational sciences often yield large datasets based on physics-based simulations, which are rarely leveraged in experimental design. We introduce a novel model-based sensor placement criterion, along with a highly-efficient optimization algorithm, which integrates physics-based simulations and Bayesian experimental design principles to identify sensor networks that "minimize information loss" from simulated data. Our technique relies on sparse variational inference and (separable) Gauss-Markov priors, and thus may adapt many techniques from Bayesian experimental design. We validate our method through a case study monitoring air temperature in Phoenix, Arizona, using state-of-the-art physics-based simulations. Our results show our framework to be superior to random or quasi-random sampling, particularly with a limited number of sensors. We conclude by discussing practical considerations and implications of our framework, including more complex modeling tools and real-world deployments.

Subjects: Methodology , Machine Learning , Computation , Machine Learning

Publish: 2025-12-05 18:38:30 UTC


#3 BalLOT: Balanced $k$-means clustering with optimal transport [PDF] [Copy] [Kimi] [REL]

Authors: Wenyan Luo, Dustin G. Mixon

We consider the fundamental problem of balanced $k$-means clustering. In particular, we introduce an optimal transport approach to alternating minimization called BalLOT, and we show that it delivers a fast and effective solution to this problem. We establish this with a variety of numerical experiments before proving several theoretical guarantees. First, we prove that for generic data, BalLOT produces integral couplings at each step. Next, we perform a landscape analysis to provide theoretical guarantees for both exact and partial recoveries of planted clusters under the stochastic ball model. Finally, we propose initialization schemes that achieve one-step recovery of planted clusters.

Subjects: Machine Learning , Data Structures and Algorithms , Information Theory , Machine Learning , Optimization and Control

Publish: 2025-12-05 18:04:35 UTC


#4 A Note on the Finite Sample Bias in Time Series Cross-Validation [PDF] [Copy] [Kimi] [REL]

Author: Amaze Lusompa

It is well known that model selection via cross validation can be biased for time series models. However, many researchers have argued that this bias does not apply when using cross-validation with vector autoregressions (VAR) or with time series models whose errors follow a martingale-like structure. I show that even under these circumstances, performing cross-validation on time series data will still generate bias in general.

Subjects: Methodology , Econometrics , Statistics Theory

Publish: 2025-12-05 17:23:54 UTC


#5 The Bayesian Way: Uncertainty, Learning, and Statistical Reasoning [PDF] [Copy] [Kimi] [REL]

Authors: Juan Sosa, Carlos A. Martínez, Danna Cruz

This paper offers a comprehensive introduction to Bayesian inference, combining historical context, theoretical foundations, and core analytical examples. Beginning with Bayes' theorem and the philosophical distinctions between Bayesian and frequentist approaches, we develop the inferential framework for estimation, interval construction, hypothesis testing, and prediction. Through canonical models, we illustrate how prior information and observed data are formally integrated to yield posterior distributions. We also explore key concepts including loss functions, credible intervals, Bayes factors, identifiability, and asymptotic behavior. While emphasizing analytical tractability in classical settings, we outline modern extensions that rely on simulation-based methods and discuss challenges related to prior specification and model evaluation. Though focused on foundational ideas, this paper sets the stage for applying Bayesian methods in contemporary domains such as hierarchical modeling, nonparametrics, and structured applications in time series, spatial data, networks, and political science. The goal is to provide a rigorous yet accessible entry point for students and researchers seeking to adopt a Bayesian perspective in statistical practice.

Subjects: Methodology , Statistics Theory

Publish: 2025-12-05 16:59:25 UTC


#6 Model selection with uncertainty in estimating optimal dynamic treatment regimes [PDF] [Copy] [Kimi] [REL]

Authors: Chunyu Wang, Brian Tom

Optimal dynamic treatment regimes (DTRs), as a key part of precision medicine, have progressively gained more attention recently. To inform clinical decision making, interpretable and parsimonious models for contrast functions are preferred, raising concerns about undue misspecification. It is therefore important to properly evaluate the performance of candidate interpretable models and select the one that best approximates the unknown contrast function. Moreover, since a DTR usually involves multiple decision points, an inaccurate approximation at a later decision point affects its estimation at an earlier decision point when a backward induction algorithm is applied. This paper aims to perform model selection for contrast functions in the context of learning optimal DTRs from observed data. Note that the relative performance of candidate models may heavily depend on the sample size when, for example, the comparison is made between parametric and tree-based models. Therefore, instead of investigating the limiting behavior of each candidate model and developing methods to select asymptotically the `correct' one, we focus on the finite sample performance of each model and attempt to perform model selection under a given sample size. To this end, we adopt the counterfactual cross-validation metric and propose a novel method to estimate the variance of the metric. Supplementing the cross-validation metric with its estimated variance allows us to characterize the uncertainty in model selection under a given sample size and facilitates hypothesis testing associated with a preferred model structure.

Subject: Methodology

Publish: 2025-12-05 13:22:52 UTC


#7 Empirical Decision Theory [PDF] [Copy] [Kimi] [REL]

Authors: Christoph Jansen, Georg Schollmeyer, Thomas Augustin, Julian Rodemann

Analyzing decision problems under uncertainty commonly relies on idealizing assumptions about the describability of the world, with the most prominent examples being the closed world and the small world assumption. Most assumptions are operationalized by introducing states of the world, conditional on which the decision situation can be analyzed without any remaining uncertainty. Conversely, most classical decision-theoretic approaches are not applicable if the states of the world are inaccessible. We propose a decision model that retains the appeal and simplicity of the original theory, but completely overcomes the need to specify the states of the world explicitly. The main idea of our approach is to address decision problems in a radically empirical way: instead of specifying states and consequences prior to the decision analysis, we only assume a protocol of observed act--consequence pairs as model primitives. We show how optimality in such empirical decision problems can be addressed by using protocol-based empirical choice functions and discuss three approaches for deriving inferential guarantees: (I) consistent statistical estimation of choice sets, (II) consistent statistical testing of choice functions with robustness guarantees, and (III) direct inference for empirical choice functions using credal sets. We illustrate our theory with a proof-of-concept application comparing different prompting strategies in generative AI models.

Subjects: Methodology , Probability , Machine Learning

Publish: 2025-12-05 12:46:04 UTC


#8 Generalised Bayesian Inference using Robust divergences for von Mises-Fisher distribution [PDF] [Copy] [Kimi] [REL]

Authors: Tomoyuki Nakagawa, Yasuhito Tsuruta, Sho Kazari, Kouji Tahata

This paper focusses on robust estimation of location and concentration parameters of the von Mises-Fisher distribution in the Bayesian framework. The von Mises-Fisher (or Langevin) distribution has played a central role in directional statistics. Directional data have been investigated for many decades, and more recently, they have gained increasing attention in diverse areas such as bioinformatics and text data analysis. Although outliers can significantly affect the estimation results even for directional data, the treatment of outliers remains an unresolved and challenging problem. In the frequentist framework, numerous studies have developed robust estimation methods for directional data with outliers, but, in contrast, only a few robust estimation methods have been proposed in the Bayesian framework. In this paper, we propose Bayesian inference based on density power-divergence and $γ$-divergence and establish their asymptotic properties and robustness. In addition, the Bayesian approach naturally provides a way to assess estimation uncertainty through the posterior distribution, which is particularly useful for small samples. Furthermore, to carry out the posterior computation, we develop the posterior computation algorithm based on the weighted Bayesian bootstrap for estimating parameters. The effectiveness of the proposed methods is demonstrated through simulation studies. Using two real datasets, we further show that the proposed method provides reliable and robust estimation even in the presence of outliers or data contamination.

Subjects: Methodology , Statistics Theory , Computation

Publish: 2025-12-05 12:25:59 UTC


#9 Efficient sequential Bayesian inference for state-space epidemic models using ensemble data assimilation [PDF] [Copy] [Kimi] [REL]

Authors: Dhorasso Temfack, Jason Wyse

Estimating latent epidemic states and model parameters from partially observed, noisy data remains a major challenge in infectious disease modeling. State-space formulations provide a coherent probabilistic framework for such inference, yet fully Bayesian estimation is often computationally prohibitive because evaluating the observed-data likelihood requires integration over all latent trajectories. The Sequential Monte Carlo squared (SMC$^2$) algorithm offers a principled approach for joint state and parameter inference, combining an outer SMC sampler over parameters with an inner particle filter that estimates the likelihood up to the current time point. Despite its theoretical appeal, this nested particle filter imposes substantial computational cost, limiting routine use in near-real-time outbreak response. We propose Ensemble SMC$^2$ (eSMC$^2$), a scalable variant that replaces the inner particle filter with an Ensemble Kalman Filter (EnKF) to approximate the incremental likelihood at each observation time. While this substitution introduces bias via a Gaussian approximation, we mitigate finite-sample effects using an unbiased Gaussian density estimator and adapt the EnKF for epidemic data through state-dependent observation variance. This makes our approach particularly suitable for overdispersed incidence data commonly encountered in infectious disease surveillance. Simulation experiments with known ground truth and an application to 2022 United States (U.S.) monkeypox incidence data demonstrate that eSMC$^2$ achieves substantial computational gains while producing posterior estimates comparable to SMC$^2$. The method accurately recovers latent epidemic trajectories and key epidemiological parameters, providing an efficient framework for sequential Bayesian inference from imperfect surveillance data.

Subjects: Methodology , Computation

Publish: 2025-12-05 11:51:55 UTC


#10 A survival analysis of glioma patients using topological features and locations of tumors [PDF] [Copy] [Kimi] [REL]

Authors: Yuhyeong Jang, Tu Dan, Eric Vu, Chul Moon

Tumor shape plays a critical role in influencing both growth and metastasis. We introduce a novel topological radiomic feature derived from persistent homology to characterize tumor shape, focusing on its association with time-to-event outcomes in gliomas. These features effectively capture diverse tumor shape patterns that are not represented by conventional radiomic measures. To incorporate these features into survival analysis, we employ a functional Cox regression model in which the topological features are represented in a functional space. We further include interaction terms between shape features and tumor location to capture lobe-specific effects. This approach enables interpretable assessment of how tumor morphology relates to survival risk. We evaluate the proposed method in two case studies using radiomic images of high-grade and low-grade gliomas. The findings suggest that the topological features serve as strong predictors of survival prognosis, remaining significant after adjusting for clinical variables, and provide additional clinically meaningful insights into tumor behavior.

Subject: Methodology

Publish: 2025-12-05 11:46:28 UTC


#11 Design-marginal calibration of Gaussian process predictive distributions: Bayesian and conformal approaches [PDF] [Copy] [Kimi] [REL]

Authors: Aurélien Pion, Emmanuel Vazquez

We study the calibration of Gaussian process (GP) predictive distributions in the interpolation setting from a design-marginal perspective. Conditioning on the data and averaging over a design measure μ, we formalize μ-coverage for central intervals and μ-probabilistic calibration through randomized probability integral transforms. We introduce two methods. cps-gp adapts conformal predictive systems to GP interpolation using standardized leave-one-out residuals, yielding stepwise predictive distributions with finite-sample marginal calibration. bcr-gp retains the GP posterior mean and replaces the Gaussian residual by a generalized normal model fitted to cross-validated standardized residuals. A Bayesian selection rule-based either on a posterior upper quantile of the variance for conservative prediction or on a cross-posterior Kolmogorov-Smirnov criterion for probabilistic calibration-controls dispersion and tail behavior while producing smooth predictive distributions suitable for sequential design. Numerical experiments on benchmark functions compare cps-gp, bcr-gp, Jackknife+ for GPs, and the full conformal Gaussian process, using calibration metrics (coverage, Kolmogorov-Smirnov, integral absolute error) and accuracy or sharpness through the scaled continuous ranked probability score.

Subjects: Machine Learning , Machine Learning , Methodology

Publish: 2025-12-05 10:53:20 UTC


#12 Multi-state Modeling of Delay Evolution in Suburban Rail Transports [PDF] [Copy] [Kimi] [REL]

Authors: Stefania Colombo, Alfredo Gimenez Zapiola, Francesca Ieva, Simone Vantini

Train delays are a persistent issue in railway systems, particularly in suburban networks where operational complexity is heightened by frequent services and high passenger volumes. Traditional delay models often overlook the temporal and structural dynamics of real delay propagation. This work applies continuous-time multi-state models to analyze the temporal evolution of delay on the S5 suburban line in Lombardy, Italy. Using detailed operational, meteorological, and contextual data, the study models delay transitions while accounting for observable heterogeneity. The findings reveal how delay dynamics vary by travel direction, time slot, and route segment. Covariates such as station saturation and passenger load are shown to significantly affect the risk of delay escalation or recovery. The study offers both methodological advancements and practical results for improving the reliability of rail services.

Subject: Applications

Publish: 2025-12-05 08:30:29 UTC


#13 Consistency of Familial DNA Search Results in Southeast Asian Populations [PDF] [Copy] [Kimi] [REL]

Authors: Monchai Kooakachai, Tiwakorn Chapalee, Chairat Thitiyan, Patsaya Jumnongwut

DNA databases are widely used in forensic science to identify unknown offenders. When no exact match is found, familial DNA searches can help by identifying first-degree relatives using likelihood ratios. If multiple subpopulations are relevant, likelihood ratios can be computed separately based on allele frequency estimates. Various strategies exist to combine these ratios, such as averaging allele frequencies or taking the average, maximum, or minimum likelihood ratio. While some comparisons have been made in populations like those in the U.S., their effectiveness in other regions remains unclear. This study evaluates likelihood ratio-based strategies in Southeast Asian populations, specifically Thailand, Malaysia, and Singapore. Our findings align with previous research, showing that statistical power varies across strategies. Among Thai subpopulations, the minimum likelihood ratio strategy is preferred, as it maintains high power while minimizing differences between subpopulations.

Subject: Applications

Publish: 2025-12-05 07:30:39 UTC


#14 Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data [PDF] [Copy] [Kimi] [REL]

Authors: Stephen Salerno, Kentaro Hoffman, Awan Afiaz, Anna Neufeld, Tyler H. McCormick, Jeffrey T. Leek

As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as substitutes for missing or unobserved data. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to drawing inference with predicted data (IPD) and show that high predictive accuracy does not guarantee valid downstream inference. We show that all such failures reduce to statistical notions of (i) bias, when predictions systematically shift the estimand or distort relationships among variables, and (ii) variance, when uncertainty from the prediction model and the intrinsic variability of the true data are ignored. We then review recent methods for conducting IPD and discuss how this framework is deeply rooted in classical statistical theory. We then comment on some open questions and interesting avenues for future work in this area, and end with some comments on how to use predicted data in scientific studies that is both transparent and statistically principled.

Subjects: Machine Learning , Machine Learning

Publish: 2025-12-05 06:24:23 UTC


#15 Symmetric Linear Dynamical Systems are Learnable from Few Observations [PDF] [Copy] [Kimi] [REL]

Authors: Minh Vu, Andrey Y. Lokhov, Marc Vuffray

We consider the problem of learning the parameters of a $N$-dimensional stochastic linear dynamics under both full and partial observations from a single trajectory of time $T$. We introduce and analyze a new estimator that achieves a small maximum element-wise error on the recovery of symmetric dynamic matrices using only $T=\mathcal{O}(\log N)$ observations, irrespective of whether the matrix is sparse or dense. This estimator is based on the method of moments and does not rely on problem-specific regularization. This is especially important for applications such as structure discovery.

Subjects: Machine Learning , Machine Learning , Systems and Control , Optimization and Control

Publish: 2025-12-05 00:33:31 UTC


#16 Optimal Watermark Generation under Type I and Type II Errors [PDF1] [Copy] [Kimi] [REL]

Authors: Hengzhi He, Shirong Xu, Alexander Nemecek, Jiping Li, Erman Ayday, Guang Cheng

Watermarking has recently emerged as a crucial tool for protecting the intellectual property of generative models and for distinguishing AI-generated content from human-generated data. Despite its practical success, most existing watermarking schemes are empirically driven and lack a theoretical understanding of the fundamental trade-off between detection power and generation fidelity. To address this gap, we formulate watermarking as a statistical hypothesis testing problem between a null distribution and its watermarked counterpart. Under explicit constraints on false-positive and false-negative rates, we derive a tight lower bound on the achievable fidelity loss, measured by a general $f$-divergence, and characterize the optimal watermarked distribution that attains this bound. We further develop a corresponding sampling rule that provides an optimal mechanism for inserting watermarks with minimal fidelity distortion. Our result establishes a simple yet broadly applicable principle linking hypothesis testing, information divergence, and watermark generation.

Subject: Methodology

Publish: 2025-12-05 00:22:15 UTC


#17 Identifiability and improper solutions in the probabilistic partial least squares regression with unique variance [PDF] [Copy] [Kimi] [REL]

Author: Takashi Arai

This paper addresses theoretical issues associated with probabilistic partial least squares (PLS) regression. As in the case of factor analysis, the probabilistic PLS regression with unique variance suffers from the issues of improper solutions and lack of identifiability, both of which causes difficulties in interpreting latent variables and model parameters. Using the fact that the probabilistic PLS regression can be viewed as a special case of factor analysis, we apply a norm constraint prescription on the factor loading matrix in the probabilistic PLS regression, which was recently proposed in the context of factor analysis to avoid improper solutions. Then, we prove that the probabilistic PLS regression with this norm constraint is identifiable. We apply the probabilistic PLS regression to data on amino acid mutations in Human Immunodeficiency Virus (HIV) protease to demonstrate the validity of the norm constraint and to confirm the identifiability numerically. Utilizing the proposed constraint enables the visualization of latent variables via a biplot. We also investigate the sampling distribution of the maximum likelihood estimates (MLE) using synthetically generated data. We numerically observe that MLE is consistent and asymptotically normally distributed.

Subjects: Methodology , Data Analysis, Statistics and Probability

Publish: 2025-12-05 00:11:25 UTC


#18 Does Rerandomization Help Beyond Covariate Adjustment? A Review and Guide for Theory and Practice [PDF] [Copy] [Kimi] [REL]

Authors: Antônio Carlos Herling Ribeiro Junior, Zach Branson

Rerandomization is a modern experimental design technique that repeatedly randomizes treatment assignments until covariates are deemed balanced between treatment groups. This enhances the precision and coherence of causal effect estimators, mitigates false discoveries from p-hacking, and increases statistical power. Recent work suggests that balancing covariates via rerandomization does not alter the asymptotic precision of covariate-adjusted estimators, thereby making it unclear whether rerandomization is worthwhile if adjusted estimators are used. However, these results have two key caveats. First, these results are asymptotic, leaving finite sample performance unknown. Second, these results focus on precision, while other potential benefits, such as increased coherence among flexible estimators, remain understudied. Hence, in this paper we provide three main contributions: (i) a comprehensive review of the rerandomization literature, covering historical foundations, theoretical developments, and recent methodological advancements, (ii) an extensive simulation study examining finite-sample performance, and (iii) a practical guide for practitioners. Our study compares precision, coherence, power, and coverage of various estimators under rerandomization versus complete randomization. We find rerandomization to be a complementary design strategy that enhances the precision, robustness, and reliability of causal effect estimators, especially for smaller sample sizes.

Subject: Methodology

Publish: 2025-12-04 22:27:10 UTC


#19 A Functional Approach to Testing Overall Effect of Interaction Between DNA Methylation and SNPs [PDF] [Copy] [Kimi] [REL]

Authors: Yvelin Gansou, Karim Oualkacha, Marzia Angela Cremona, Lajmi Lakhal-Chaieb

We introduce a test for the overall effect of interaction between DNA methylation and a set of single nucleotide polymorphisms (SNPs) on a quantitative phenotype. The developed inference procedure is based on a functional approach that extends existing regression models in functional data analysis. Through extensive simulations, we show that the proposed test effectively controls type I error rates and highlights increased empirical power over existing methods, particularly when multiple interactions are present. The use of the proposed test is illustrated with an application to data from obesity patients and controls.

Subjects: Methodology , Applications

Publish: 2025-12-04 21:56:33 UTC


#20 One-Step Diffusion Samplers via Self-Distillation and Deterministic Flow [PDF] [Copy] [Kimi1] [REL]

Authors: Pascal Jutras-Dube, Jiaru Zhang, Ziran Wang, Ruqi Zhang

Sampling from unnormalized target distributions is a fundamental yet challenging task in machine learning and statistics. Existing sampling algorithms typically require many iterative steps to produce high-quality samples, leading to high computational costs. We introduce one-step diffusion samplers which learn a step-conditioned ODE so that one large step reproduces the trajectory of many small ones via a state-space consistency loss. We further show that standard ELBO estimates in diffusion samplers degrade in the few-step regime because common discrete integrators yield mismatched forward/backward transition kernels. Motivated by this analysis, we derive a deterministic-flow (DF) importance weight for ELBO estimation without a backward kernel. To calibrate DF, we introduce a volume-consistency regularization that aligns the accumulated volume change along the flow across step resolutions. Our proposed sampler therefore achieves both sampling and stable evidence estimate in only one or few steps. Across challenging synthetic and Bayesian benchmarks, it achieves competitive sample quality with orders-of-magnitude fewer network evaluations while maintaining robust ELBO estimates.

Subjects: Machine Learning , Machine Learning

Publish: 2025-12-04 20:57:53 UTC


#21 Exchangeable Gaussian Processes with application to epidemics [PDF] [Copy] [Kimi] [REL]

Authors: Lampros Bouranis, Petros Barmpounakis, Nikolaos Demiris, Konstantinos Kalogeropoulos

We develop a Bayesian non-parametric framework based on multi-task Gaussian processes, appropriate for temporal shrinkage. We focus on a particular class of dynamic hierarchical models to obtain evidence-based knowledge of infectious disease burden. These models induce a parsimonious way to capture cross-dependence between groups while retaining a natural interpretation based on an underlying mean process, itself expressed as a Gaussian process. We analyse distinct types of outbreak data from recent epidemics and find that the proposed models result in improved predictive ability against competing alternatives.

Subjects: Methodology , Applications , Computation

Publish: 2025-12-04 20:01:41 UTC


#22 How to Tame Your LLM: Semantic Collapse in Continuous Systems [PDF] [Copy] [Kimi] [REL]

Author: C. M. Wyss

We develop a general theory of semantic dynamics for large language models by formalizing them as Continuous State Machines (CSMs): smooth dynamical systems whose latent manifolds evolve under probabilistic transition operators. The associated transfer operator $P: L^2(M,μ) \to L^2(M,μ)$ encodes the propagation of semantic mass. Under mild regularity assumptions (compactness, ergodicity, bounded Jacobian), $P$ is compact with discrete spectrum. Within this setting, we prove the Semantic Characterization Theorem (SCT): the leading eigenfunctions of $P$ induce finitely many spectral basins of invariant meaning, each definable in an o-minimal structure over $\mathbb{R}$. Thus spectral lumpability and logical tameness coincide. This explains how discrete symbolic semantics can emerge from continuous computation: the continuous activation manifold collapses into a finite, logically interpretable ontology. We further extend the SCT to stochastic and adiabatic (time-inhomogeneous) settings, showing that slowly drifting kernels preserve compactness, spectral coherence, and basin structure.

Subjects: Machine Learning , Artificial Intelligence , Machine Learning , Dynamical Systems , Probability

Publish: 2025-12-04 11:33:02 UTC


#23 Developing synthetic microdata through machine learning for firm-level business surveys [PDF] [Copy] [Kimi] [REL]

Authors: Jorge Cisneros Paz, Timothy Wojan, Matthew Williams, Jennifer Ozawa, Robert Chew, Kimberly Janda, Timothy Navarro, Michael Floyd, Christine Task, Damon Streat

Public-use microdata samples (PUMS) from the United States (US) Census Bureau on individuals have been available for decades. However, large increases in computing power and the greater availability of Big Data have dramatically increased the probability of re-identifying anonymized data, potentially violating the pledge of confidentiality given to survey respondents. Data science tools can be used to produce synthetic data that preserve critical moments of the empirical data but do not contain the records of any existing individual respondent or business. Developing public-use firm data from surveys presents unique challenges different from demographic data, because there is a lack of anonymity and certain industries can be easily identified in each geographic area. This paper briefly describes a machine learning model used to construct a synthetic PUMS based on the Annual Business Survey (ABS) and discusses various quality metrics. Although the ABS PUMS is currently being refined and results are confidential, we present two synthetic PUMS developed for the 2007 Survey of Business Owners, similar to the ABS business data. Econometric replication of a high impact analysis published in Small Business Economics demonstrates the verisimilitude of the synthetic data to the true data and motivates discussion of possible ABS use cases.

Subjects: Machine Learning , General Economics , Applications , Methodology

Publish: 2025-12-05 18:44:30 UTC


#24 On the Bayes Inconsistency of Disagreement Discrepancy Surrogates [PDF] [Copy] [Kimi] [REL]

Authors: Neil G. Marchant, Andrew C. Cullen, Feng Liu, Sarah M. Erfani

Deep neural networks often fail when deployed in real-world contexts due to distribution shift, a critical barrier to building safe and reliable systems. An emerging approach to address this problem relies on \emph{disagreement discrepancy} -- a measure of how the disagreement between two models changes under a shifting distribution. The process of maximizing this measure has seen applications in bounding error under shifts, testing for harmful shifts, and training more robust models. However, this optimization involves the non-differentiable zero-one loss, necessitating the use of practical surrogate losses. We prove that existing surrogates for disagreement discrepancy are not Bayes consistent, revealing a fundamental flaw: maximizing these surrogates can fail to maximize the true disagreement discrepancy. To address this, we introduce new theoretical results providing both upper and lower bounds on the optimality gap for such surrogates. Guided by this theory, we propose a novel disagreement loss that, when paired with cross-entropy, yields a provably consistent surrogate for disagreement discrepancy. Empirical evaluations across diverse benchmarks demonstrate that our method provides more accurate and robust estimates of disagreement discrepancy than existing approaches, particularly under challenging adversarial conditions.

Subjects: Machine Learning , Machine Learning

Publish: 2025-12-05 18:16:03 UTC


#25 A Residual Variance Matching Recursive Least Squares Filter for Real-time UAV Terrain Following [PDF] [Copy] [Kimi] [REL]

Authors: Xiaobo Wu, Youmin Zhang

Accurate real-time waypoints estimation for the UAV-based online Terrain Following during wildfire patrol missions is critical to ensuring flight safety and enabling wildfire detection. However, existing real-time filtering algorithms struggle to maintain accurate waypoints under measurement noise in nonlinear and time-varying systems, posing risks of flight instability and missed wildfire detections during UAV-based terrain following. To address this issue, a Residual Variance Matching Recursive Least Squares (RVM-RLS) filter, guided by a Residual Variance Matching Estimation (RVME) criterion, is proposed to adaptively estimate the real-time waypoints of nonlinear, time-varying UAV-based terrain following systems. The proposed method is validated using a UAV-based online terrain following system within a simulated terrain environment. Experimental results show that the RVM-RLS filter improves waypoints estimation accuracy by approximately 88$\%$ compared with benchmark algorithms across multiple evaluation metrics. These findings demonstrate both the methodological advances in real-time filtering and the practical potential of the RVM-RLS filter for UAV-based online wildfire patrol.

Subjects: Signal Processing , Robotics , Machine Learning

Publish: 2025-12-05 17:55:32 UTC