Methodology

2025-05-20 | | Total: 33

#1 Model Selection for Gaussian-gated Gaussian Mixture of Experts Using Dendrograms of Mixing Measures [PDF] [Copy] [Kimi1] [REL]

Authors: Tuan Thai, TrungTin Nguyen, Dat Do, Nhat Ho, Christopher Drovandi

Mixture of Experts (MoE) models constitute a widely utilized class of ensemble learning approaches in statistics and machine learning, known for their flexibility and computational efficiency. They have become integral components in numerous state-of-the-art deep neural network architectures, particularly for analyzing heterogeneous data across diverse domains. Despite their practical success, the theoretical understanding of model selection, especially concerning the optimal number of mixture components or experts, remains limited and poses significant challenges. These challenges primarily stem from the inclusion of covariates in both the Gaussian gating functions and expert networks, which introduces intrinsic interactions governed by partial differential equations with respect to their parameters. In this paper, we revisit the concept of dendrograms of mixing measures and introduce a novel extension to Gaussian-gated Gaussian MoE models that enables consistent estimation of the true number of mixture components and achieves the pointwise optimal convergence rate for parameter estimation in overfitted scenarios. Notably, this approach circumvents the need to train and compare a range of models with varying numbers of components, thereby alleviating the computational burden, particularly in high-dimensional or deep neural network settings. Experimental results on synthetic data demonstrate the effectiveness of the proposed method in accurately recovering the number of experts. It outperforms common criteria such as the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood, while achieving optimal convergence rates for parameter estimation and accurately approximating the regression function.

Subjects: Machine Learning , Machine Learning , Statistics Theory , Computation , Methodology

Publish: 2025-05-19 12:41:19 UTC


#2 CLT and Edgeworth Expansion for m-out-of-n Bootstrap Estimators of The Studentized Median [PDF] [Copy] [Kimi] [REL]

Authors: Imon Banerjee, Sayak Chakrabarty

The m-out-of-n bootstrap, originally proposed by Bickel, Gotze, and Zwet (1992), approximates the distribution of a statistic by repeatedly drawing m subsamples (with m much smaller than n) without replacement from an original sample of size n. It is now routinely used for robust inference with heavy-tailed data, bandwidth selection, and other large-sample applications. Despite its broad applicability across econometrics, biostatistics, and machine learning, rigorous parameter-free guarantees for the soundness of the m-out-of-n bootstrap when estimating sample quantiles have remained elusive. This paper establishes such guarantees by analyzing the estimator of sample quantiles obtained from m-out-of-n resampling of a dataset of size n. We first prove a central limit theorem for a fully data-driven version of the estimator that holds under a mild moment condition and involves no unknown nuisance parameters. We then show that the moment assumption is essentially tight by constructing a counter-example in which the CLT fails. Strengthening the assumptions slightly, we derive an Edgeworth expansion that provides exact convergence rates and, as a corollary, a Berry Esseen bound on the bootstrap approximation error. Finally, we illustrate the scope of our results by deriving parameter-free asymptotic distributions for practical statistics, including the quantiles for random walk Metropolis-Hastings and the rewards of ergodic Markov decision processes, thereby demonstrating the usefulness of our theory in modern estimation and learning tasks.

Subjects: Machine Learning , Artificial Intelligence , Computational Engineering, Finance, and Science , Statistics Theory , Methodology , Machine Learning

Publish: 2025-05-16 22:14:49 UTC


#3 (Visualizing) Plausible Treatment Effect Paths [PDF] [Copy] [Kimi] [REL]

Authors: Simon Freyaldenhoven, Christian Hansen

We consider point estimation and inference for the treatment effect path of a policy. Examples include dynamic treatment effects in microeconomics, impulse response functions in macroeconomics, and event study paths in finance. We present two sets of plausible bounds to quantify and visualize the uncertainty associated with this object. Both plausible bounds are often substantially tighter than traditional confidence intervals, and can provide useful insights even when traditional (uniform) confidence bands appear uninformative. Our bounds can also lead to markedly different conclusions when there is significant correlation in the estimates, reflecting the fact that traditional confidence bands can be ineffective at visualizing the impact of such correlation. Our first set of bounds covers the average (or overall) effect rather than the entire treatment path. Our second set of bounds imposes data-driven smoothness restrictions on the treatment path. Post-selection Inference (Berk et al. [2013]) provides formal coverage guarantees for these bounds. The chosen restrictions also imply novel point estimates that perform well across our simulations.

Subjects: Econometrics , Methodology

Publish: 2025-05-17 14:09:39 UTC


#4 Proximal optimal transport divergences [PDF] [Copy] [Kimi] [REL]

Authors: Ricardo Baptista, Panagiota Birmpa, Markos A. Katsoulakis, Luc Rey-Bellet, Benjamin J. Zhang

We introduce proximal optimal transport divergence, a novel discrepancy measure that interpolates between information divergences and optimal transport distances via an infimal convolution formulation. This divergence provides a principled foundation for optimal transport proximals and proximal optimization methods frequently used in generative modeling. We explore its mathematical properties, including smoothness, boundedness, and computational tractability, and establish connections to primal-dual formulation and adversarial learning. Building on the Benamou-Brenier dynamic formulation of optimal transport cost, we also establish a dynamic formulation for proximal OT divergences. The resulting dynamic formulation is a first order mean-field game whose optimality conditions are governed by a pair of nonlinear partial differential equations, a backward Hamilton-Jacobi and a forward continuity partial differential equations. Our framework generalizes existing approaches while offering new insights and computational tools for generative modeling, distributional optimization, and gradient-based learning in probability spaces.

Subjects: Optimization and Control , Probability , Methodology , Machine Learning

Publish: 2025-05-17 17:48:11 UTC


#5 Metric Graph Kernels via the Tropical Torelli Map [PDF] [Copy] [Kimi] [REL]

Authors: Yueqi Cao, Anthea Monod

We propose new graph kernels grounded in the study of metric graphs via tropical algebraic geometry. In contrast to conventional graph kernels that are based on graph combinatorics such as nodes, edges, and subgraphs, our graph kernels are purely based on the geometry and topology of the underlying metric space. A key characterizing property of our construction is its invariance under edge subdivision, making the kernels intrinsically well-suited for comparing graphs that represent different underlying spaces. We develop efficient algorithms for computing these kernels and analyze their complexity, showing that it depends primarily on the genus of the input graphs. Empirically, our kernels outperform existing methods in label-free settings, as demonstrated on both synthetic and real-world benchmark datasets. We further highlight their practical utility through an urban road network classification task.

Subjects: Machine Learning , Methodology , Machine Learning

Publish: 2025-05-17 20:00:50 UTC


#6 Stereographic Multi-Try Metropolis Algorithms for Heavy-tailed Sampling [PDF] [Copy] [Kimi] [REL]

Authors: Zhihao Wang, Jun Yang

Markov chain Monte Carlo (MCMC) methods for sampling from heavy-tailed distributions present unique challenges, particularly in high dimensions. Multi-proposal MCMC algorithms have recently gained attention for their potential to improve performance, especially through parallel implementation on modern hardware. This paper introduces a novel family of gradient-free MCMC algorithms that combine the multi-try Metropolis (MTM) with stereographic MCMC framework, specifically designed for efficient sampling from heavy-tailed targets. The proposed stereographic multi-try Metropolis (SMTM) algorithm not only outperforms traditional Euclidean MTM and existing stereographic random-walk Metropolis methods, but also avoids the pathological convergence behavior often observed in MTM and demonstrates strong robustness to tuning. These properties are supported by scaling analysis and extensive simulation studies.

Subjects: Computation , Methodology , Machine Learning

Publish: 2025-05-18 16:21:23 UTC


#7 Causality-Inspired Robustness for Nonlinear Models via Representation Learning [PDF] [Copy] [Kimi] [REL]

Authors: Marin Šola, Peter Bühlmann, Xinwei Shen

Distributional robustness is a central goal of prediction algorithms due to the prevalent distribution shifts in real-world data. The prediction model aims to minimize the worst-case risk among a class of distributions, a.k.a., an uncertainty set. Causality provides a modeling framework with a rigorous robustness guarantee in the above sense, where the uncertainty set is data-driven rather than pre-specified as in traditional distributional robustness optimization. However, current causality-inspired robustness methods possess finite-radius robustness guarantees only in the linear settings, where the causal relationships among the covariates and the response are linear. In this work, we propose a nonlinear method under a causal framework by incorporating recent developments in identifiable representation learning and establish a distributional robustness guarantee. To our best knowledge, this is the first causality-inspired robustness method with such a finite-radius robustness guarantee in nonlinear settings. Empirical validation of the theoretical findings is conducted on both synthetic data and real-world single-cell data, also illustrating that finite-radius robustness is crucial.

Subjects: Machine Learning , Machine Learning , Methodology

Publish: 2025-05-19 08:52:15 UTC


#8 SIMBA -- A Bayesian Decision Framework for the Identification of Optimal Biomarker Subgroups for Cancer Basket Clinical Trials [PDF] [Copy] [Kimi] [REL]

Authors: Shijie Yuan, Jiaxin Liu, Zhihua Gong, Xia Qin, Crystal Qin, Yuan Ji, Peter Müller

We consider basket trials in which a biomarker-targeting drug may be efficacious for patients across different disease indications. Patients are enrolled if their cells exhibit some levels of biomarker expression. The threshold level is allowed to vary by indication. The proposed SIMBA method uses a decision framework to identify optimal biomarker subgroups (OBS) defined by an optimal biomarker threshold for each indication. The optimality is achieved through minimizing a posterior expected loss that balances estimation accuracy and investigator preference for broadly effective therapeutics. A Bayesian hierarchical model is proposed to adaptively borrow information across indications and enhance the accuracy in the estimation of the OBS. The operating characteristics of SIMBA are assessed via simulations and compared against a simplified version and an existing alternative method, both of which do not borrow information. SIMBA is expected to improve the identification of patient sub-populations that may benefit from a biomarker-driven therapeutics.

Subjects: Applications , Methodology

Publish: 2025-05-19 14:54:28 UTC


#9 From What Ifs to Insights: Counterfactuals in Causal Inference vs. Explainable AI [PDF] [Copy] [Kimi] [REL]

Authors: Galit Shmueli, David Martens, Jaewon Yoo, Travis Greene

Counterfactuals play a pivotal role in the two distinct data science fields of causal inference (CI) and explainable artificial intelligence (XAI). While the core idea behind counterfactuals remains the same in both fields--the examination of what would have happened under different circumstances--there are key differences in how they are used and interpreted. We introduce a formal definition that encompasses the multi-faceted concept of the counterfactual in CI and XAI. We then discuss how counterfactuals are used, evaluated, generated, and operationalized in CI vs. XAI, highlighting conceptual and practical differences. By comparing and contrasting the two, we hope to identify opportunities for cross-fertilization across CI and XAI.

Subjects: Machine Learning , Artificial Intelligence , Machine Learning , Econometrics , Methodology

Publish: 2025-05-19 16:34:36 UTC


#10 Discretion in the Loop: Human Expertise in Algorithm-Assisted College Advising [PDF] [Copy] [Kimi] [REL]

Authors: Sofiia Druchyna, Kara Schechtman, Benjamin Brandon, Jenise Stafford, Hannah Li, Lydia T. Liu

In higher education, many institutions use algorithmic alerts to flag at-risk students and deliver advising at scale. While much research has focused on evaluating algorithmic predictions, relatively little is known about how discretionary interventions by human experts shape outcomes in algorithm-assisted settings. We study this question using rich quantitative and qualitative data from a randomized controlled trial of an algorithm-assisted advising program at Georgia State University. Taking a mixed-methods approach, we examine whether and how advisors use context unavailable to an algorithm to guide interventions and influence student success. We develop a causal graphical framework for human expertise in the interventional setting, extending prior work on discretion in purely predictive settings. We then test a necessary condition for discretionary expertise using structured advisor logs and student outcomes data, identifying several interventions that meet the criterion for statistical significance. Accordingly, we estimate that 2 out of 3 interventions taken by advisors in the treatment arm were plausibly "expertly targeted" to students using non-algorithmic context. Systematic qualitative analysis of advisor notes corroborates these findings, showing that advisors incorporate diverse forms of contextual information--such as personal circumstances, financial issues, and student engagement--into their decisions. Finally, we explore the broader implications of human discretion for long-term outcomes and equity, using heterogeneous treatment effect estimation. Our results offer theoretical and practical insight into the real-world effectiveness of algorithm-supported college advising, and underscore the importance of accounting for human expertise in the design, evaluation, and implementation of algorithmic decision systems.

Subjects: Computers and Society , Applications , Methodology , Machine Learning

Publish: 2025-05-19 16:34:40 UTC


#11 Theory: Multidimensional Space of Events [PDF] [Copy] [Kimi] [REL]

Author: Sergii Kavun

This paper extends Bayesian probability theory by developing a multidimensional space of events (MDSE) theory that accounts for mutual influences between events and hypotheses sets. While traditional Bayesian approaches assume conditional independence between certain variables, real-world systems often exhibit complex interdependencies that limit classical model applicability. Building on established probabilistic foundations, our approach introduces a mathematical formalism for modeling these complex relationships. We developed the MDSE theory through rigorous mathematical derivation and validated it using three complementary methodologies: analytical proofs, computational simulations, and case studies drawn from diverse domains. Results demonstrate that MDSE successfully models complex dependencies with 15-20% improved prediction accuracy compared to standard Bayesian methods when applied to datasets with high interdimensionality. This theory particularly excels in scenarios with over 50 interrelated variables, where traditional methods show exponential computational complexity growth while MDSE maintains polynomial scaling. Our findings indicate that MDSE provides a viable mathematical foundation for extending Bayesian reasoning to complex systems while maintaining computational tractability. This approach offers practical applications in engineering challenges including risk assessment, resource optimization, and forecasting problems where multiple interdependent factors must be simultaneously considered.

Subjects: Methodology , Logic , Probability , Machine Learning

Publish: 2025-05-16 08:54:12 UTC


#12 Covariate-moderated Empirical Bayes Matrix Factorization [PDF] [Copy] [Kimi] [REL]

Authors: William R. P. Denault, Karl Tayeb, Peter Carbonetto, Jason Willwerscheid, Matthew Stephens

Matrix factorization is a fundamental method in statistics and machine learning for inferring and summarizing structure in multivariate data. Modern data sets often come with ``side information'' of various forms (images, text, graphs) that can be leveraged to improve estimation of the underlying structure. However, existing methods that leverage side information are limited in the types of data they can incorporate, and they assume specific parametric models. Here, we introduce a novel method for this problem, covariate-moderated empirical Bayes matrix factorization (cEBMF). cEBMF is a modular framework that accepts any type of side information that is processable by a probabilistic model or neural network. The cEBMF framework can accommodate different assumptions and constraints on the factors through the use of different priors, and it adapts these priors to the data. We demonstrate the benefits of cEBMF in simulations and in analyses of spatial transcriptomics and MovieLens data.

Subject: Methodology

Publish: 2025-05-16 19:05:11 UTC


#13 Model-Based Clustering with Sequential Outlier Identification using the Distribution of Mahalanobis Distances [PDF] [Copy] [Kimi] [REL]

Authors: Ultán P. Doherty, Paul D. McNicholas, Arthur White

The presence of outliers can prevent clustering algorithms from accurately determining an appropriate group structure within a data set. We present outlierMBC, a model-based approach for sequentially removing outliers and clustering the remaining observations. Our method identifies outliers one at a time while fitting a multivariate Gaussian mixture model to data. Since it can be difficult to classify observations as outliers without knowing what the correct cluster structure is a priori, and the presence of outliers interferes with the process of modelling clusters correctly, we use an iterative method to identify outliers one by one. At each iteration, outlierMBC removes the observation with the lowest density and fits a Gaussian mixture model to the remaining data. The method continues to remove potential outliers until a pre-set maximum number of outliers is reached, then retrospectively identifies the optimal number of outliers. To decide how many outliers to remove, it uses the fact that the squared sample Mahalanobis distances of Gaussian distributed observations are Beta distributed when scaled appropriately. outlierMBC chooses the number of outliers which minimises a dissimilarity between this theoretical Beta distribution and the observed distribution of the scaled squared sample Mahalanobis distances. This means that our method both clusters the data using a Gaussian mixture model and implements a model-based procedure to identify the optimal outliers to remove without requiring the number of outliers to be pre-specified. Unlike leading methods in the literature, outlierMBC does not assume that the outliers follow a known distribution or that the number of outliers can be pre-specified. Moreover, outlierMBC performs strongly compared to these algorithms when applied to a range of simulated and real data sets.

Subjects: Methodology , Computation

Publish: 2025-05-16 20:04:14 UTC


#14 BLOG: Bayesian Longitudinal Omics with Group Constraints [PDF] [Copy] [Kimi] [REL]

Authors: Livia Popa, Sumanta Basu, Myung Hee Lee, Martin T. Wells

Clinical investigators are increasingly interested in discovering computational biomarkers from short-term longitudinal omics data sets. This work focuses on Bayesian regression and variable selection for longitudinal omics datasets, which can quantify uncertainty and control false discovery. In our univariate approach, Zellner's $g$ prior is used with two different options of the tuning parameter $g$: $g=\sqrt{n}$ and a $g$ that minimizes Stein's unbiased risk estimate (SURE). Bayes Factors were used to quantify uncertainty and control for false discovery. In the multivariate approach, we use Bayesian Group LASSO with a spike and slab prior for group variable selection. In both approaches, we use the first difference ($\Delta$) scale of longitudinal predictor and the response. These methods work together to enhance our understanding of biomarker identification, improving inference and prediction. We compare our method against commonly used linear mixed effect models on simulated data and real data from a Tuberculosis (TB) study on metabolite biomarker selection. With an automated selection of hyperparameters, the Zellner's $g$ prior approach correctly identifies target metabolites with high specificity and sensitivity across various simulation and real data scenarios. The Multivariate Bayesian Group Lasso spike and slab approach also correctly selects target metabolites across various simulation scenarios.

Subjects: Methodology , Applications

Publish: 2025-05-16 20:17:16 UTC


#15 Robust outlier detection for heterogeneous distributions applicable to censoring in functional MRI [PDF] [Copy] [Kimi] [REL]

Authors: Saranjeet Singh Saluja, Fatma Parlak, Damon Pham, Amanda Mejia

Functional magnetic resonance imaging (fMRI) data are prone to intense "burst" noise artifacts due to head movements and other sources. Such volumes can be considered as high-dimensional outliers that can be identified using statistical outlier detection techniques, which allows for controlling the false positive rate. Previous work has used dimension reduction and multivariate outlier detection techniques, including the use of robust minimum covariance determinant (MCD) distances. Under Gaussianity, the distribution of these robust distances can be approximated, and an upper quantile of that distribution can be used to identify outlying volumes. Unfortunately, the Gaussian assumption is unrealistic for fMRI data in this context. One way to address this is to transform the data to Normality. A limitation of existing robust methods for this purpose, such as robust Box-Cox and Yeo-Johnson transformations, is that they can deal with skew but not heavy or light tails. Here, we develop a novel robust method for transformation to central Normality based on the highly flexible sinh-arcsinh (SHASH) family of distributions. To avoid the influence of outliers, it is crucial to initialize the outlier labels with a high degree of sensitivity. For this purpose, we consider a commonplace robust z-score approach, and a modified isolation forest (iForest) approach, a popular technique for anomaly detection in machine learning. Through extensive simulation studies, we find that our proposed SHASH transformation initialized using iForest clearly outperforms benchmark methods in a variety of settings, including skewed and heavy tailed distributions, and light to heavy outlier contamination. We also apply the proposed techniques to several example datasets and find this combination to have consistently strong performance.

Subject: Methodology

Publish: 2025-05-17 03:24:44 UTC


#16 Model-X Change-Point Detection of Conditional Distribution [PDF] [Copy] [Kimi] [REL]

Authors: Yiwen Huang, Yan Dong, Mengying Yan, Ziye Tian, Chuan Hong, Doudou Zhou, Molei Liu

The dynamic nature of many real-world systems can lead to temporal outcome model shifts, causing a deterioration in model accuracy and reliability over time. This requires change-point detection on the outcome models to guide model retraining and adjustments. However, inferring the change point of conditional models is more prone to loss of validity or power than classic detection problems for marginal distributions. This is due to both the temporal covariate shift and the complexity of the outcome model. To address these challenges, we propose a novel model-X Conditional Random Testing (CRT) method computationally enhanced with latent mixture model (LMM) distillation for simultaneous change-point detection and localization of the conditional outcome model. Built upon the model-X framework, our approach can effectively adjust for the potential bias caused by the temporal covariate shift and allow the flexible use of general machine learning methods for outcome modeling. It preserves good validity against complex or erroneous outcome models, even with imperfect knowledge of the temporal covariate shift learned from some auxiliary unlabeled data. Moreover, the incorporation of LMM distillation significantly reduces the computational burden of the CRT by eliminating the need for repeated complex model refitting in its resampling procedure and preserves the statistical validity and power well. Theoretical validity of the proposed method is justified. Extensive simulation studies and a real-world example demonstrate the statistical effectiveness and computational scalability of our method as well as its significant improvements over existing methods.

Subject: Methodology

Publish: 2025-05-17 14:27:21 UTC


#17 Cyclic-Shift Sparse Kronecker Tensor Classifier for Signal-Region Detection in Neuroimaging [PDF] [Copy] [Kimi] [REL]

Authors: Hsin-Hsiung Huang, Yuh-Haur Chen, Teng Zhang

This study proposes a cyclic-shift logistic sparse Kronecker product decomposition (SKPD) model for high-dimensional tensor data, enhancing the SKPD framework with a cyclic-shift mechanism for binary classification. The method enables interpretable and scalable analysis of brain MRI data, detecting disease-relevant regions through a structured low-rank factorization. By incorporating a second spatially shifted view of the data, the cyclic-shift logistic SKPD improves robustness to misalignment across subjects, a common challenge in neuroimaging. We provide asymptotic consistency guarantees under a restricted isometry condition adapted to logistic loss. Simulations confirm the model's ability to recover spatial signals under noise and identify optimal patch sizes for factor decomposition. Application to OASIS-1 and ADNI-1 datasets demonstrates that the model achieves strong classification accuracy and localizes estimated coefficients in clinically relevant brain regions, such as the hippocampus. A data-driven slice selection strategy further improves interpretability in 2D projections. The proposed framework offers a principled, interpretable, and computationally efficient tool for neuroimaging-based disease diagnosis, with potential extensions to multi-class settings and more complex transformations.

Subjects: Methodology , Computation

Publish: 2025-05-17 18:43:04 UTC


#18 Counterfactual Q Learning via the Linear Buckley James Method for Longitudinal Survival Data [PDF] [Copy] [Kimi] [REL]

Authors: Jeongjin Lee, Jong-Min Kim

Treatment strategies are critical in healthcare, particularly when outcomes are subject to censoring. This study introduces the Counterfactual Buckley-James Q-Learning framework, which integrates the Buckley-James method with reinforcement learning to address challenges posed by censored survival data. The Buckley-James method imputes censored survival times via conditional expectations based on observed data, offering a robust mechanism for handling incomplete outcomes. By incorporating these imputed values into a counterfactual Q-learning framework, the proposed method enables the estimation and comparison of potential outcomes under different treatment strategies. This facilitates the identification of optimal dynamic treatment regimes that maximize expected survival time. Through extensive simulation studies, the method demonstrates robust performance across various sample sizes and censoring scenarios, including right censoring and missing at random (MAR). Application to real-world clinical trial data further highlights the utility of this approach in informing personalized treatment decisions, providing an interpretable and reliable tool for optimizing survival outcomes in complex clinical settings.

Subjects: Methodology , Computation

Publish: 2025-05-17 22:36:38 UTC


#19 Reliable fairness auditing with semi-supervised inference [PDF] [Copy] [Kimi] [REL]

Authors: Jianhui Gao, Jessica Gronsbell

Machine learning (ML) models often exhibit bias that can exacerbate inequities in biomedical applications. Fairness auditing, the process of evaluating a model's performance across subpopulations, is critical for identifying and mitigating these biases. However, such audits typically rely on large volumes of labeled data, which are costly and labor-intensive to obtain. To address this challenge, we introduce $\textit{Infairness}$, a unified framework for auditing a wide range of fairness criteria using semi-supervised inference. Our approach combines a small labeled dataset with a large unlabeled dataset by imputing missing outcomes via regression with carefully selected nonlinear basis functions. We show that our proposed estimator is (i) consistent regardless of whether the ML or imputation models are correctly specified and (ii) more efficient than standard supervised estimation with the labeled data when the imputation model is correctly specified. Through extensive simulations, we also demonstrate that Infairness consistently achieves higher precision than supervised estimation. In a real-world application of phenotyping depression from electronic health records data, Infairness reduces variance by up to 64% compared to supervised estimation, underscoring its value for reliable fairness auditing with limited labeled data.

Subject: Methodology

Publish: 2025-05-18 00:42:21 UTC


#20 Estimation of Treatment Harm Rate via Partitioning [PDF] [Copy] [Kimi] [REL]

Authors: Wei Liang, Changbao Wu

In causal inference with binary outcomes, there is a growing interest in estimation of treatment harm rate (THR), which is a measure of treatment risk and reveals treatment effect heterogeneity in a subpopulation. The THR is generally non-identifiable even for randomized controlled trials (RCTs), and existing works focus primarily on the estimation of the THR under either untestable identification or ambiguous model assumptions. We develop a class of partitioning-based bounds for the THR based on data from RCTs with two distinct features: Our proposed bounds effectively use available auxiliary covariates information and the bounds can be consistently estimated without relying on any untestable or ambiguous model assumptions. Finite sample performances of our proposed interval estimators along with a conservatively extended confidence interval for the THR are evaluated through Monte Carlo simulation studies. An application of the proposed methods to the ACTG 175 data is presented. A Python package named partbte for the partitioning-based algorithm has been developed and is available on https://github.com/w62liang/partition-te.

Subject: Methodology

Publish: 2025-05-18 02:54:56 UTC


#21 A Hybrid Prior Bayesian Method for Combining Domestic Real-World Data and Overseas Data in Global Drug Development [PDF] [Copy] [Kimi] [REL]

Authors: Keer Chen, Zengyue Zheng, Pengfei Zhu, Shuping Jiang, Nan Li, Jumin Deng, Pingyan Chen, Zhenyu Wu, Ying Wu

Background Hybrid clinical trial design integrates randomized controlled trials (RCTs) with real-world data (RWD) to enhance efficiency through dynamic incorporation of external data. Existing methods like the Meta-Analytic Predictive Prior (MAP) inadequately control data heterogeneity, adjust baseline discrepancies, or optimize dynamic borrowing proportions, introducing bias and limiting applications in bridging trials and multi-regional clinical trials (MRCTs). Objective This study proposes a novel hybrid Bayesian framework (EQPS-rMAP) to address heterogeneity and bias in multi-source data integration, validated through simulations and retrospective case analyses of risankizumab's efficacy in moderate-to-severe plaque psoriasis. Design and Methods EQPS-rMAP eliminates baseline covariate discrepancies via propensity score stratification, constructs stratum-specific MAP priors to dynamically adjust external data weights, and introduces equivalence probability weights to quantify data conflict risks. Performance was evaluated across six simulated scenarios (heterogeneity differences, baseline shifts) and real-world case analyses, comparing it with traditional methods (MAP, PSMAP, EBMAP) on estimation bias, type I error control, and sample size requirements. Results Simulations show EQPS-rMAP maintains estimation robustness under significant heterogeneity while reducing sample size demands and enhancing trial efficiency. Case analyses confirm superior external bias control and accuracy compared to conventional approaches. Conclusion and Significance EQPS-rMAP provides empirical evidence for hybrid clinical designs. By resolving baseline-heterogeneity conflicts through adaptive mechanisms, it enables reliable integration of external and real-world data in bridging trials, MRCTs, and post-marketing studies, broadening applicability without compromising rigor.

Subjects: Methodology , Statistics Theory

Publish: 2025-05-18 08:42:17 UTC


#22 Truncated Gaussian copula principal component analysis with application to pediatric acute lymphoblastic leukemia patients' gut microbiome [PDF] [Copy] [Kimi] [REL]

Authors: Lei Wang, Yang Ni, Irina Gaynanova

Increasing epidemiologic evidence suggests that the diversity and composition of the gut microbiome can predict infection risk in cancer patients. Infections remain a major cause of morbidity and mortality during chemotherapy. Analyzing microbiome data to identify associations with infection pathogenesis for proactive treatment has become a critical research focus. However, the high-dimensional nature of the data necessitates the use of dimension-reduction methods to facilitate inference and interpretation. Traditional dimension reduction methods, which assume Gaussianity, perform poorly with skewed and zero-inflated microbiome data. To address these challenges, we propose a semiparametric principal component analysis (PCA) method based on a truncated latent Gaussian copula model that accommodates both skewness and zero inflation. Simulation studies demonstrate that the proposed method outperforms existing approaches by providing more accurate estimates of scores and loadings across various copula transformation settings. We apply our method, along with competing approaches, to gut microbiome data from pediatric patients with acute lymphoblastic leukemia. The principal scores derived from the proposed method reveal the strongest associations between pre-chemotherapy microbiome composition and adverse events during subsequent chemotherapy, offering valuable insights for improving patient outcomes.

Subject: Methodology

Publish: 2025-05-18 15:59:24 UTC


#23 Modeling Nonstationary Extremal Dependence via Deep Spatial Deformations [PDF] [Copy] [Kimi] [REL]

Authors: Xuanjie Shao, Jordan Richards, Raphael Huser

Modeling nonstationarity that often prevails in extremal dependence of spatial data can be challenging, and typically requires bespoke or complex spatial models that are difficult to estimate. Inference for stationary and isotropic models is considerably easier, but the assumptions that underpin these models are rarely met by data observed over large or topographically complex domains. A possible approach for accommodating nonstationarity in a spatial model is to warp the spatial domain to a latent space where stationarity and isotropy can be reasonably assumed. Although this approach is very flexible, estimating the warping function can be computationally expensive, and the transformation is not always guaranteed to be bijective, which may lead to physically unrealistic transformations when the domain folds onto itself. We overcome these challenges by developing deep compositional spatial models to capture nonstationarity in extremal dependence. Specifically, we focus on modeling high threshold exceedances of process functionals by leveraging efficient inference methods for limiting $r$-Pareto processes. A detailed high-dimensional simulation study demonstrates the superior performance of our model in estimating the warped space. We illustrate our method by modeling UK precipitation extremes and show that we can efficiently estimate the extremal dependence structure of data observed at thousands of locations.

Subjects: Methodology , Machine Learning

Publish: 2025-05-18 21:22:00 UTC


#24 Extended semi-Latin squares for use in field and glasshouse trials [PDF] [Copy] [Kimi] [REL]

Author: E. R. Williams

Semi-Latin squares have been extensively studied. They can be interpreted as a special case of latinized block designs where the number of columns is equal to the number of replicates in the design. Latinized row-column designs are frequently used in field and glasshouse trials when replicates are contiguous. These designs allow for the efficient adjustment of row and column effects within replicates. Here we define extended semi-Latin squares as a special case of latinized row-column designs and investigate optimality using the average efficiency factor.

Subject: Methodology

Publish: 2025-05-18 23:06:03 UTC


#25 Double machine learning to estimate the effects of multiple treatments and their interactions [PDF] [Copy] [Kimi] [REL]

Authors: Qingyan Xiang, Yubai Yuan, Dongyuan Song, Usman J. Wudil, Muktar H. Aliyu, C. William Wester, Bryan E. Shepherd

Causal inference literature has extensively focused on binary treatments, with relatively fewer methods developed for multi-valued treatments. In particular, methods for multiple simultaneously assigned treatments remain understudied despite their practical importance. This paper introduces two settings: (1) estimating the effects of multiple treatments of different types (binary, categorical, and continuous) and the effects of treatment interactions, and (2) estimating the average treatment effect across categories of multi-valued regimens. To obtain robust estimates for both settings, we propose a class of methods based on the Double Machine Learning (DML) framework. Our methods are well-suited for complex settings of multiple treatments/regimens, using machine learning to model confounding relationships while overcoming regularization and overfitting biases through Neyman orthogonality and cross-fitting. To our knowledge, this work is the first to apply machine learning for robust estimation of interaction effects in the presence of multiple treatments. We further establish the asymptotic distribution of our estimators and derive variance estimators for statistical inference. Extensive simulations demonstrate the performance of our methods. Finally, we apply the methods to study the effect of three treatments on HIV-associated kidney disease in an adult HIV cohort of 2455 participants in Nigeria.

Subjects: Methodology , Applications

Publish: 2025-05-19 02:02:43 UTC