Statistics

2025-05-15 | | Total: 36

#1 Adaptively-weighted Nearest Neighbors for Matrix Completion [PDF] [Copy] [Kimi] [REL]

Authors: Tathagata Sadhukhan, Manit Paul, Raaz Dwivedi

In this technical note, we introduce and analyze AWNN: an adaptively weighted nearest neighbor method for performing matrix completion. Nearest neighbor (NN) methods are widely used in missing data problems across multiple disciplines such as in recommender systems and for performing counterfactual inference in panel data settings. Prior works have shown that in addition to being very intuitive and easy to implement, NN methods enjoy nice theoretical guarantees. However, the performance of majority of the NN methods rely on the appropriate choice of the radii and the weights assigned to each member in the nearest neighbor set and despite several works on nearest neighbor methods in the past two decades, there does not exist a systematic approach of choosing the radii and the weights without relying on methods like cross-validation. AWNN addresses this challenge by judiciously balancing the bias variance trade off inherent in weighted nearest-neighbor regression. We provide theoretical guarantees for the proposed method under minimal assumptions and support the theory via synthetic experiments.

Subjects: Machine Learning , Machine Learning , Statistics Theory , Methodology

Publish: 2025-05-14 17:59:17 UTC


#2 Robust Representation and Estimation of Barycenters and Modes of Probability Measures on Metric Spaces [PDF] [Copy] [Kimi] [REL]

Authors: Washington Mio, Tom Needham

This paper is concerned with the problem of defining and estimating statistics for distributions on spaces such as Riemannian manifolds and more general metric spaces. The challenge comes, in part, from the fact that statistics such as means and modes may be unstable: for example, a small perturbation to a distribution can lead to a large change in Fréchet means on spaces as simple as a circle. We address this issue by introducing a new merge tree representation of barycenters called the barycentric merge tree (BMT), which takes the form of a measured metric graph and summarizes features of the distribution in a multiscale manner. Modes are treated as special cases of barycenters through diffusion distances. In contrast to the properties of classical means and modes, we prove that BMTs are stable -- this is quantified as a Lipschitz estimate involving optimal transport metrics. This stability allows us to derive a consistency result for approximating BMTs from empirical measures, with explicit convergence rates. We also give a provably accurate method for discretely approximating the BMT construction and use this to provide numerical examples for distributions on spheres and shape spaces.

Subjects: Statistics Theory , Metric Geometry , Methodology

Publish: 2025-05-14 17:58:22 UTC


#3 Design of Experiments for Emulations: A Selective Review from a Modeling Perspective [PDF] [Copy] [Kimi] [REL]

Authors: Xinwei Deng, Lulu Kang, C. Devon Lin

Space-filling designs are crucial for efficient computer experiments, enabling accurate surrogate modeling and uncertainty quantification in many scientific and engineering applications, such as digital twin systems and cyber-physical systems. In this work, we will provide a comprehensive review on key design methodologies, including Maximin/miniMax designs, Latin hypercubes, and projection-based designs. Moreover, we will connect the space-filling design criteria like the fill distance to Gaussian process performance. Numerical studies are conducted to investigate the practical trade-offs among various design types, with the discussion on emerging challenges in high-dimensional and constrained settings. The paper concludes with future directions in adaptive sampling and machine learning integration, providing guidance for improving computational experiments.

Subject: Methodology

Publish: 2025-05-14 17:44:51 UTC


#4 Scalable Computations for Generalized Mixed Effects Models with Crossed Random Effects Using Krylov Subspace Methods [PDF] [Copy] [Kimi] [REL]

Authors: Pascal Kündig, Fabio Sigrist

Mixed effects models are widely used for modeling data with hierarchically grouped structures and high-cardinality categorical predictor variables. However, for high-dimensional crossed random effects, current standard computations relying on Cholesky decompositions can become prohibitively slow. In this work, we present novel Krylov subspace-based methods that address several existing computational bottlenecks. Among other things, we theoretically analyze and empirically evaluate various preconditioners for the conjugate gradient and stochastic Lanczos quadrature methods, derive new convergence results, and develop computationally efficient methods for calculating predictive variances. Extensive experiments using simulated and real-world data sets show that our proposed methods scale much better than Cholesky-based computations, for instance, achieving a runtime reduction of approximately two orders of magnitudes for both estimation and prediction. Moreover, our software implementation is up to 10'000 times faster and more stable than state-of-the-art implementations such as lme4 and glmmTMB when using default settings. Our methods are implemented in the free C++ software library GPBoost with high-level Python and R packages.

Subjects: Methodology , Machine Learning , Machine Learning

Publish: 2025-05-14 16:50:19 UTC


#5 Depth-Based Local Center Clustering: A Framework for Handling Different Clustering Scenarios [PDF] [Copy] [Kimi] [REL]

Authors: Siyi Wang, Alexandre Leblanc, Paul D. McNicholas

Cluster analysis, or clustering, plays a crucial role across numerous scientific and engineering domains. Despite the wealth of clustering methods proposed over the past decades, each method is typically designed for specific scenarios and presents certain limitations in practical applications. In this paper, we propose depth-based local center clustering (DLCC). This novel method makes use of data depth, which is known to produce a center-outward ordering of sample points in a multivariate space. However, data depth typically fails to capture the multimodal characteristics of {data}, something of the utmost importance in the context of clustering. To overcome this, DLCC makes use of a local version of data depth that is based on subsets of {data}. From this, local centers can be identified as well as clusters of varying shapes. Furthermore, we propose a new internal metric based on density-based clustering to evaluate clustering performance on {non-convex clusters}. Overall, DLCC is a flexible clustering approach that seems to overcome some limitations of traditional clustering methods, thereby enhancing data analysis capabilities across a wide range of application scenarios.

Subjects: Methodology , Machine Learning , Applications

Publish: 2025-05-14 16:08:11 UTC


#6 Deep-SITAR: A SITAR-Based Deep Learning Framework for Growth Curve Modeling via Autoencoders [PDF] [Copy] [Kimi] [REL]

Authors: María Alejandra Hernández, Oscar Rodriguez, Dae-Jin Lee

Several approaches have been developed to capture the complexity and nonlinearity of human growth. One widely used is the Super Imposition by Translation and Rotation (SITAR) model, which has become popular in studies of adolescent growth. SITAR is a shape-invariant mixed-effects model that represents the shared growth pattern of a population using a natural cubic spline mean curve while incorporating three subject-specific random effects -- timing, size, and growth intensity -- to account for variations among individuals. In this work, we introduce a supervised deep learning framework based on an autoencoder architecture that integrates a deep neural network (neural network) with a B-spline model to estimate the SITAR model. In this approach, the encoder estimates the random effects for each individual, while the decoder performs a fitting based on B-splines similar to the classic SITAR model. We refer to this method as the Deep-SITAR model. This innovative approach enables the prediction of the random effects of new individuals entering a population without requiring a full model re-estimation. As a result, Deep-SITAR offers a powerful approach to predicting growth trajectories, combining the flexibility and efficiency of deep learning with the interpretability of traditional mixed-effects models.

Subjects: Machine Learning , Machine Learning

Publish: 2025-05-14 15:55:16 UTC


#7 Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data [PDF] [Copy] [Kimi] [REL]

Authors: Rui Miao, Babak Shahbaba, Annie Qu

Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus, may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak partial coverage assumption on behavior policies. In addition, our simulation studies and a real data application demonstrate the superior numerical performance of the proposed method compared with existing methods.

Subjects: Machine Learning , Machine Learning

Publish: 2025-05-14 15:44:10 UTC


#8 Fairness-aware Bayes optimal functional classification [PDF] [Copy] [Kimi] [REL]

Authors: Xiaoyu Hu, Gengyu Xue, Zhenhua Lin, Yi Yu

Algorithmic fairness has become a central topic in machine learning, and mitigating disparities across different subpopulations has emerged as a rapidly growing research area. In this paper, we systematically study the classification of functional data under fairness constraints, ensuring the disparity level of the classifier is controlled below a pre-specified threshold. We propose a unified framework for fairness-aware functional classification, tackling an infinite-dimensional functional space, addressing key challenges from the absence of density ratios and intractability of posterior probabilities, and discussing unique phenomena in functional classification. We further design a post-processing algorithm, Fair Functional Linear Discriminant Analysis classifier (Fair-FLDA), which targets at homoscedastic Gaussian processes and achieves fairness via group-wise thresholding. Under weak structural assumptions on eigenspace, theoretical guarantees on fairness and excess risk controls are established. As a byproduct, our results cover the excess risk control of the standard FLDA as a special case, which, to the best of our knowledge, is first time seen. Our theoretical findings are complemented by extensive numerical experiments on synthetic and real datasets, highlighting the practicality of our designed algorithm.

Subjects: Machine Learning , Machine Learning , Statistics Theory , Methodology

Publish: 2025-05-14 15:22:09 UTC


#9 A Bayesian Treatment Selection Design for Phase II Randomised Cancer Clinical Trials [PDF] [Copy] [Kimi] [REL]

Authors: Moka Komaki, Satoru Shinoda, Haiyan Zheng, Kouji Yamamoto

It is crucial to design Phase II cancer clinical trials that balance the efficiency of treatment selection with clinical practicality. Sargent and Goldberg proposed a frequentist design that allow decision-making even when the primary endpoint is ambiguous. However, frequentist approaches rely on fixed thresholds and long-run frequency properties, which can limit flexibility in practical applications. In contrast, the Bayesian decision rule, based on posterior probabilities, enables transparent decision-making by incorporating prior knowledge and updating beliefs with new data, addressing some of the inherent limitations of frequentist designs. In this study, we propose a novel Bayesian design, allowing selection of a best-performing treatment. Specifically, concerning phase II clinical trials with a binary outcome, our decision rule employs posterior interval probability by integrating the joint distribution over all values, for which the 'success rate' of the bester-performing treatment is greater than that of the other(s). This design can then determine which a treatment should proceed to the next phase, given predefined decision thresholds. Furthermore, we propose two sample size determination methods to empower such treatment selection designs implemented in a Bayesian framework. Through simulation studies and real-data applications, we demonstrate how this approach can overcome challenges related to sample size constraints in randomised trials. In addition, we present a user-friendly R Shiny application, enabling clinicians to Bayesian designs. Both our methodology and the software application can advance the design and analysis of clinical trials for evaluating cancer treatments.

Subjects: Methodology , Applications

Publish: 2025-05-14 15:10:50 UTC


#10 Independent Component Analysis by Robust Distance Correlation [PDF] [Copy] [Kimi] [REL]

Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw, Tom Van Deuren, Tim Verdonck

Independent component analysis (ICA) is a powerful tool for decomposing a multivariate signal or distribution into fully independent sources, not just uncorrelated ones. Unfortunately, most approaches to ICA are not robust against outliers. Here we propose a robust ICA method called RICA, which estimates the components by minimizing a robust measure of dependence between multivariate random variables. The dependence measure used is the distance correlation (dCor). In order to make it more robust we first apply a new transformation called the bowl transform, which is bounded, one-to-one, continuous, and maps far outliers to points close to the origin. This preserves the crucial property that a zero dCor implies independence. RICA estimates the independent sources sequentially, by looking for the component that has the smallest dCor with the remainder. RICA is strongly consistent and has the usual parametric rate of convergence. Its robustness is investigated by a simulation study, in which it generally outperforms its competitors. The method is illustrated on three applications, including the well-known cocktail party problem.

Subjects: Computation , Machine Learning

Publish: 2025-05-14 14:25:43 UTC


#11 Semiparametric marginal promotion time cure model for clustered survival data [PDF] [Copy] [Kimi] [REL]

Authors: Fei Xiao, Yingwei Peng, Dipankar Bandyopadhyayd, Yi Niu

Modeling clustered/correlated failure time data has been becoming increasingly important in clinical trials and epidemiology studies. In this paper, we consider a semiparametric marginal promotion time cure model for clustered right-censored survival data with a cure fraction. We propose two estimation methods based on the generalized estimating equations and the quadratic inference functions and prove that the regression estimates from the two proposed methods are consistent and asymptotic normal and that the estimates from the quadratic inference functions are optimal. The simulation study shows that the estimates from both methods are more efficient than those from the existing method no matter whether the correlation structure is correctly specified. The estimates based on the quadratic inference functions achieve higher efficiency compared with those based on the generalized estimating equations under the same working correlation structure. An application of the proposed methods is demonstrated with periodontal disease data and new findings are revealed in the analysis.

Subject: Methodology

Publish: 2025-05-14 09:36:16 UTC


#12 Optimal Transport-Based Domain Adaptation for Rotated Linear Regression [PDF] [Copy] [Kimi] [REL]

Authors: Brian Britos, Mathias Bourel

Optimal Transport (OT) has proven effective for domain adaptation (DA) by aligning distributions across domains with differing statistical properties. Building on the approach of Courty et al. (2016), who mapped source data to the target domain for improved model transfer, we focus on a supervised DA problem involving linear regression models under rotational shifts. This ongoing work considers cases where source and target domains are related by a rotation-common in applications like sensor calibration or image orientation. We show that in $\mathbb{R}^2$ , when using a p-norm cost with $p $\ge$ 2$, the optimal transport map recovers the underlying rotation. Based on this, we propose an algorithm that combines K-means clustering, OT, and singular value decomposition (SVD) to estimate the rotation angle and adapt the regression model. This method is particularly effective when the target domain is sparsely sampled, leveraging abundant source data for improved generalization. Our contributions offer both theoretical and practical insights into OT-based model adaptation under geometric transformations.

Subjects: Machine Learning , Machine Learning , Probability

Publish: 2025-05-14 09:06:40 UTC


#13 Generalizing imaging biomarker repeatability studies using Bayesian inference: Applications in detecting heterogeneous treatment response in whole-body diffusion-weighted MRI of metastatic prostate cancer [PDF] [Copy] [Kimi] [REL]

Authors: Matthew D Blackledge, Konstantinos Zormpas-Petridis, Ricardo Donners, Antonio Candito, David J Collins, Johann de Bono, Chris Parker, Dow-Mu Koh, Nina Tunariu

The assessment of imaging biomarkers is critical for advancing precision medicine and improving disease characterization. Despite the availability of methods to derive disease heterogeneity metrics in imaging studies, a robust framework for evaluating measurement uncertainty remains underdeveloped. To address this gap, we propose a novel Bayesian framework to assess the precision of disease heterogeneity measures in biomarker studies. Our approach extends traditional methods for evaluating biomarker precision by providing greater flexibility in statistical assumptions and enabling the analysis of biomarkers beyond univariate or multivariate normally-distributed variables. Using Hamiltonian Monte Carlo sampling, the framework supports both, for example, normally-distributed and Dirichlet-Multinomial distributed variables, enabling the derivation of posterior distributions for biomarker parameters under diverse model assumptions. Designed to be broadly applicable across various imaging modalities and biomarker types, the framework builds a foundation for generalizing reproducible and objective biomarker evaluation. To demonstrate utility, we apply the framework to whole-body diffusion-weighted MRI (WBDWI) to assess heterogeneous therapeutic responses in metastatic bone disease. Specifically, we analyze data from two patient studies investigating treatments for metastatic castrate-resistant prostate cancer (mCRPC). Our results reveal an approximately 70% response rate among individual tumors across both studies, objectively characterizing differential responses to systemic therapies and validating the clinical relevance of the proposed methodology. This Bayesian framework provides a powerful tool for advancing biomarker research across diverse imaging-based studies while offering valuable insights into specific clinical applications, such as mCRPC treatment response.

Subjects: Applications , Image and Video Processing

Publish: 2025-05-14 07:18:06 UTC


#14 Online Learning of Neural Networks [PDF] [Copy] [Kimi] [REL]

Authors: Amit Daniely, Idan Mehalel, Elchanan Mossel

We study online learning of feedforward neural networks with the sign activation function that implement functions from the unit ball in $\mathbb{R}^d$ to a finite label set $\{1, \ldots, Y\}$. First, we characterize a margin condition that is sufficient and in some cases necessary for online learnability of a neural network: Every neuron in the first hidden layer classifies all instances with some margin $\gamma$ bounded away from zero. Quantitatively, we prove that for any net, the optimal mistake bound is at most approximately $\mathtt{TS}(d,\gamma)$, which is the $(d,\gamma)$-totally-separable-packing number, a more restricted variation of the standard $(d,\gamma)$-packing number. We complement this result by constructing a net on which any learner makes $\mathtt{TS}(d,\gamma)$ many mistakes. We also give a quantitative lower bound of approximately $\mathtt{TS}(d,\gamma) \geq \max\{1/(\gamma \sqrt{d})^d, d\}$ when $\gamma \geq 1/2$, implying that for some nets and input sequences every learner will err for $\exp(d)$ many times, and that a dimension-free mistake bound is almost always impossible. To remedy this inevitable dependence on $d$, it is natural to seek additional natural restrictions to be placed on the network, so that the dependence on $d$ is removed. We study two such restrictions. The first is the multi-index model, in which the function computed by the net depends only on $k \ll d$ orthonormal directions. We prove a mistake bound of approximately $(1.5/\gamma)^{k + 2}$ in this model. The second is the extended margin assumption. In this setting, we assume that all neurons (in all layers) in the network classify every ingoing input from previous layer with margin $\gamma$ bounded away from zero. In this model, we prove a mistake bound of approximately $(\log Y)/ \gamma^{O(L)}$, where L is the depth of the network.

Subjects: Machine Learning , Machine Learning

Publish: 2025-05-14 06:03:07 UTC


#15 Nelson-Aalen kernel estimator to the tail index of right censored Pareto-type data [PDF] [Copy] [Kimi] [REL]

Authors: Nour Elhouda Guesmia, Abdelhakim Necir, Djamel Meraghni

On the basis of Nelson-Aalen product-limit estimator of a randomly censored distribution function, we introduce a kernel estimator to the tail index of right-censored Pareto-like data. Under some regularity assumptions, the consistency and asymptotic normality of the proposed estimator are established. A small simulation study shows that the proposed estimator performs much better, in terms of bias and stability, than the existing ones with, a slight increase in the mean squared error. The results are applied to insurance loss data to illustrate the practical effectiveness of our estimator.

Subject: Statistics Theory

Publish: 2025-05-14 05:25:06 UTC


#16 Model-free High Dimensional Mediator Selection with False Discovery Rate Control [PDF] [Copy] [Kimi] [REL]

Authors: Runqiu Wang, Ran Dai, Jieqiong Wang, Charlie Soh, Ziyang Xu, Mohamed Azzam, Cheng Zheng

There is a challenge in selecting high-dimensional mediators when the mediators have complex correlation structures and interactions. In this work, we frame the high-dimensional mediator selection problem into a series of hypothesis tests with composite nulls, and develop a method to control the false discovery rate (FDR) which has mild assumptions on the mediation model. We show the theoretical guarantee that the proposed method and algorithm achieve FDR control. We present extensive simulation results to demonstrate the power and finite sample performance compared with existing methods. Lastly, we demonstrate the method for analyzing the Alzheimer's Disease Neuroimaging Initiative (ADNI) data, in which the proposed method selects the volume of the hippocampus and amygdala, as well as some other important MRI-derived measures as mediators for the relationship between gender and dementia progression.

Subject: Methodology

Publish: 2025-05-14 03:21:37 UTC


#17 Sequential Scoring Rule Evaluation for Forecast Method Selection [PDF] [Copy] [Kimi] [REL]

Authors: David T. Frazier, Donald S. Poskitt

This paper shows that sequential statistical analysis techniques can be generalised to the problem of selecting between alternative forecasting methods using scoring rules. A return to basic principles is necessary in order to show that ideas and concepts from sequential statistical methods can be adapted and applied to sequential scoring rule evaluation (SSRE). One key technical contribution of this paper is the development of a large deviations type result for SSRE schemes using a change of measure that parallels a traditional exponential tilting form. Further, we also show that SSRE will terminate in finite time with probability one, and that the moments of the SSRE stopping time exist. A second key contribution is to show that the exponential tilting form underlying our large deviations result allows us to cast SSRE within the framework of generalised e-values. Relying on this formulation, we devise sequential testing approaches that are both powerful and maintain control on error probabilities underlying the analysis. Through several simulated examples, we demonstrate that our e-values based SSRE approach delivers reliable results that are more powerful than more commonly applied testing methods precisely in the situations where these commonly applied methods can be expected to fail.

Subjects: Statistics Theory , Econometrics , Methodology

Publish: 2025-05-14 02:52:15 UTC


#18 Risk Bounds For Distributional Regression [PDF] [Copy] [Kimi] [REL]

Authors: Carlos Misael Madrid Padilla, Oscar Hernan Madrid Padilla, Sabyasachi Chatterjee

This work examines risk bounds for nonparametric distributional regression estimators. For convex-constrained distributional regression, general upper bounds are established for the continuous ranked probability score (CRPS) and the worst-case mean squared error (MSE) across the domain. These theoretical results are applied to isotonic and trend filtering distributional regression, yielding convergence rates consistent with those for mean estimation. Furthermore, a general upper bound is derived for distributional regression under non-convex constraints, with a specific application to neural network-based estimators. Comprehensive experiments on both simulated and real data validate the theoretical contributions, demonstrating their practical effectiveness.

Subjects: Machine Learning , Machine Learning

Publish: 2025-05-14 02:22:12 UTC


#19 Exploratory Hierarchical Factor Analysis with an Application to Psychological Measurement [PDF] [Copy] [Kimi] [REL]

Authors: Jiawei Qiao, Yunxiao Chen, Zhiliang Ying

Hierarchical factor models, which include the bifactor model as a special case, are useful in social and behavioural sciences for measuring hierarchically structured constructs. Specifying a hierarchical factor model involves imposing hierarchically structured zero constraints on a factor loading matrix, which is a demanding task that can result in misspecification. Therefore, an exploratory analysis is often needed to learn the hierarchical factor structure from data. Unfortunately, we lack an identifiability theory for the learnability of this hierarchical structure and a computationally efficient method with provable performance. The method of Schmid-Leiman transformation, which is often regarded as the default method for exploratory hierarchical factor analysis, is flawed and likely to fail. The contribution of this paper is three-fold. First, an identifiability result is established for general hierarchical factor models, which shows that the hierarchical factor structure is learnable under mild regularity conditions. Second, a computationally efficient divide-and-conquer approach is proposed for learning the hierarchical factor structure. This approach has two building blocks:(1) a constraint-based continuous optimisation algorithm and (2) a search algorithm based on an information criterion, that together explore the structure of factors nested within a given factor. Finally, asymptotic theory is established for the proposed method, showing that it can consistently recover the true hierarchical factor structure as the sample size grows to infinity. The power of the proposed method is shown via simulation studies and a real data application to a personality test. The computation code for the proposed method is publicly available at https://anonymous.4open.science/r/Exact-Exploratory-Hierarchical-Factor-Analysis-F850.

Subjects: Methodology , Statistics Theory

Publish: 2025-05-14 00:46:58 UTC


#20 Probabilistic Wind Power Forecasting via Non-Stationary Gaussian Processes [PDF] [Copy] [Kimi] [REL]

Authors: Domniki Ladopoulou, Dat Minh Hong, Petros Dellaportas

Accurate probabilistic forecasting of wind power is essential for maintaining grid stability and enabling efficient integration of renewable energy sources. Gaussian Process (GP) models offer a principled framework for quantifying uncertainty; however, conventional approaches rely on stationary kernels, which are inadequate for modeling the inherently non-stationary nature of wind speed and power output. We propose a non-stationary GP framework that incorporates the generalized spectral mixture (GSM) kernel, enabling the model to capture time-varying patterns and heteroscedastic behaviors in wind speed and wind power data. We evaluate the performance of the proposed model on real-world SCADA data across short\mbox{-,} medium-, and long-term forecasting horizons. Compared to standard radial basis function and spectral mixture kernels, the GSM-based model outperforms, particularly in short-term forecasts. These results highlight the necessity of modeling non-stationarity in wind power forecasting and demonstrate the practical value of non-stationary GP models in operational settings.

Subjects: Applications , Machine Learning , Machine Learning

Publish: 2025-05-13 23:46:33 UTC


#21 Lower Bounds on the MMSE of Adversarially Inferring Sensitive Features [PDF] [Copy] [Kimi] [REL]

Authors: Monica Welfert, Nathan Stromberg, Mario Diaz, Lalitha Sankar

We propose an adversarial evaluation framework for sensitive feature inference based on minimum mean-squared error (MMSE) estimation with a finite sample size and linear predictive models. Our approach establishes theoretical lower bounds on the true MMSE of inferring sensitive features from noisy observations of other correlated features. These bounds are expressed in terms of the empirical MMSE under a restricted hypothesis class and a non-negative error term. The error term captures both the estimation error due to finite number of samples and the approximation error from using a restricted hypothesis class. For linear predictive models, we derive closed-form bounds, which are order optimal in terms of the noise variance, on the approximation error for several classes of relationships between the sensitive and non-sensitive features, including linear mappings, binary symmetric channels, and class-conditional multi-variate Gaussian distributions. We also present a new lower bound that relies on the MSE computed on a hold-out validation dataset of the MMSE estimator learned on finite-samples and a restricted hypothesis class. Through empirical evaluation, we demonstrate that our framework serves as an effective tool for MMSE-based adversarial evaluation of sensitive feature inference that balances theoretical guarantees with practical efficiency.

Subjects: Machine Learning , Machine Learning

Publish: 2025-05-13 22:39:24 UTC


#22 Causal Feedback Discovery using Convergence Cross Mapping from Sea Ice Data [PDF] [Copy] [Kimi] [REL]

Authors: Francis Nji, Seraj Al Mahmud Mostafa, Jianwu Wang

The Arctic region is experiencing accelerated warming, largely driven by complex and nonlinear interactions among time series atmospheric variables such as, sea ice extent, short-wave radiation, temperature, and humidity. These interactions significantly alter sea ice dynamics and atmospheric conditions, leading to increased sea ice loss. This loss further intensifies Arctic amplification and disrupts weather patterns through various feedback mechanisms. Although stochastic methods such as Granger causality, PCMCI, and VarLiNGAM estimate causal interactions among atmospheric variables, they are limited to unidirectional causal relationships and often miss weak causal interactions and feedback loops in nonlinear settings. In this study, we show that Convergent Cross Mapping (CCM) can effectively estimate nonlinear causal coupling, identify weak interactions and causal feedback loops among atmospheric variables. CCM employs state space reconstruction (SSR) which makes it suitable for complex nonlinear dynamic systems. While CCM has been successfully applied to a diverse range of systems, including fisheries and online social networks, its application in climate science is under-explored. Our results show that CCM effectively uncovers strong nonlinear causal feedback loops and weak causal interactions often overlooked by stochastic methods in complex nonlinear dynamic atmospheric systems.

Subjects: Applications , Atmospheric and Oceanic Physics

Publish: 2025-05-13 22:36:07 UTC


#23 Modern causal inference approaches to improve power for subgroup analysis in randomized controlled trials [PDF] [Copy] [Kimi] [REL]

Authors: Antonio D'Alessandro, Jiyu Kim, Samrachana Adhikari, Donald Goff, Falco Bargagli Stoffi, Michele Santacatterina

In randomized controlled trials (RCTs), subgroup analyses are often planned to evaluate the heterogeneity of treatment effects within pre-specified subgroups of interest. However, these analyses frequently have small sample sizes, reducing the power to detect heterogeneous effects. A way to increase power is by borrowing external data from similar RCTs or observational studies. In this project, we target the conditional average treatment effect (CATE) as the estimand of interest, provide identification assumptions, and propose a doubly robust estimator that uses machine learning and Bayesian nonparametric techniques. Borrowing data, however, may present the additional challenge of practical violations of the positivity assumption, the conditional probability of receiving treatment in the external data source may be small, leading to large inverse weights and erroneous inferences, thus negating the potential power gains from borrowing external data. To overcome this challenge, we also propose a covariate balancing approach, an automated debiased machine learning (DML) estimator, and a calibrated DML estimator. We show improved power in various simulations and offer practical recommendations for the application of the proposed methods. Finally, we apply them to evaluate the effectiveness of citalopram, a drug commonly used to treat depression, for negative symptoms in first-episode schizophrenia patients across subgroups defined by duration of untreated psychosis, using data from two RCTs and an observational study.

Subject: Methodology

Publish: 2025-05-13 20:57:16 UTC


#24 Statistical Decision Theory with Counterfactual Loss [PDF] [Copy] [Kimi] [REL]

Authors: Benedikt Koch, Kosuke Imai

Classical statistical decision theory evaluates treatment choices based solely on observed outcomes. However, by ignoring counterfactual outcomes, it cannot assess the quality of decisions relative to feasible alternatives. For example, the quality of a physician's decision may depend not only on patient survival, but also on whether a less invasive treatment could have produced a similar result. To address this limitation, we extend standard decision theory to incorporate counterfactual losses--criteria that evaluate decisions using all potential outcomes. The central challenge in this generalization is identification: because only one potential outcome is observed for each unit, the associated risk under a counterfactual loss is generally not identifiable. We show that under the assumption of strong ignorability, a counterfactual risk is identifiable if and only if the counterfactual loss function is additive in the potential outcomes. Moreover, we demonstrate that additive counterfactual losses can yield treatment recommendations that differ from those based on standard loss functions, provided that the decision problem involves more than two treatment options.

Subjects: Statistics Theory , Machine Learning , Theoretical Economics

Publish: 2025-05-13 19:00:07 UTC


#25 Bounding Neyman-Pearson Region with $f$-Divergences [PDF] [Copy] [Kimi] [REL]

Authors: Andrew Mullhaupt, Cheng Peng

The Neyman-Pearson region of a simple binary hypothesis testing is the set of points whose coordinates represent the false positive rate and false negative rate of some test. The lower boundary of this region is given by the Neyman-Pearson lemma, and is up to a coordinate change, equivalent to the optimal ROC curve. We establish a novel lower bound for the boundary in terms of any $f$-divergence. Since the bound generated by hockey-stick $f$-divergences characterizes the Neyman-Pearson boundary, this bound is best possible. In the case of KL divergence, this bound improves Pinsker's inequality. Furthermore, we obtain a closed-form refined upper bound for the Neyman-Pearson boundary in terms of the Chernoff $\alpha$-coefficient. Finally, we present methods for constructing pairs of distributions that can approximately or exactly realize any given Neyman-Pearson boundary.

Subjects: Statistics Theory , Machine Learning , Machine Learning

Publish: 2025-05-13 18:42:10 UTC