2026-04-15 | | Total: 56
Interference arises when the treatment assigned to one individual affects the outcomes of other individuals. Commonly, individuals are naturally grouped into clusters, and interference occurs only among individuals within the same cluster, a setting referred to as partial interference. We study network causal effects on outcome quantiles in the presence of partial interference. We develop a general nonparametric efficiency theory for estimating these network quantile causal effects, which leads to a nonparametrically efficient estimator. The proposed estimator is consistent and asymptotically normal with parametric convergence rates, while allowing for flexible, data-adaptive estimation of complex nuisance functions. We leverage a three-way cross-fitting procedure that avoids direct estimation of the conditional outcome distribution. Simulations demonstrate adequate finite-sample performance of the proposed estimators, and we apply the methods to a clustered observational study.
This work establishes that an optimal transport~(OT) problem regularized by a given $f$-divergence admits the same solution as another OT problem regularized by a different $g$-divergence, under an appropriate transformation of the cost function. This structural equivalence between OT problems regularized by distinct divergences, in the sense of sharing the same unique minimizer, is demonstrated within the framework of Polish spaces with bounded cost functions.
Predicting counterfactual outcomes in longitudinal data, where sequential treatment decisions heavily depend on evolving patient states, is critical yet notoriously challenging due to complex time-dependent confounding and inadequate uncertainty quantification in existing methods. We introduce the Causal Diffusion Model (CDM), the first denoising diffusion probabilistic approach explicitly designed to generate full probabilistic distributions of counterfactual outcomes under sequential interventions. CDM employs a novel residual denoising architecture with relational self-attention, capturing intricate temporal dependencies and multimodal outcome trajectories without requiring explicit adjustments (e.g., inverse-probability weighting or adversarial balancing) for confounding. In rigorous evaluation on a pharmacokinetic-pharmacodynamic tumor-growth simulator widely adopted in prior work, CDM consistently outperforms state-of-the-art longitudinal causal inference methods, achieving a 15-30% relative improvement in distributional accuracy (1-Wasserstein distance) while maintaining competitive or superior point-estimate accuracy (RMSE) under high-confounding regimes. By unifying uncertainty quantification and robust counterfactual prediction in complex, sequentially confounded settings, without tailored deconfounding, CDM offers a flexible, high-impact tool for decision support in medicine, policy evaluation, and other longitudinal domains.
We define dynamic treatment regimes and associated potential outcomes for data described by marked point processes (MPPs). These definitions motivate MPP analogues of the commonly used consistency, exchangeability, and positivity conditions that are sufficient for identifying effects in MPP data structures. The conditions are formulated based on martingale theory, which allows us to derive explicit identifying assumptions for data described by stochastic processes. The definitions and conditions align with well-established discrete-time results in important special cases. Thus, this work bridges the large literatures on survival (event history) analysis with counting processes in continuous time and causal inference with variables in discrete-time. After formulating a set of identification conditions, we derive and characterize marginal g-formulas. The g-formulas are generally different from those studied in related works, though they coincide in important special cases. We relate our findings to previous work on causal inference with (counting) processes, the classical survival literature, and the discrete-time causal inference literature.
Recently, the statistical properties of empirical Entropic Optimal Transport (EOT) have attracted great interest, as this quantity has been shown to be useful for complex data analysis, among other reasons due to its computational efficiency. In several applications, it has been observed that the EOT plan provides valuable information beyond just the optimal value. For example, in cell biology, colocalization analysis based on the EOT plan has been introduced as a measure for quantification of spatial proximity of different protein assemblies. Despite recent progress in the analysis of its risk properties, a precise understanding of its statistical fluctuations to make it accessible for inference remains elusive to a large extent. In this paper, we derive asymptotic weak convergence result for a large class of functionals of the EOT plan, in which the colocalization process is included. The proof is based on Hadamard differentiability and the extended delta method. As an application, we obtain uniform confidence bands for colocalization curves and bootstrap consistency. Our theory is supported by simulation studies and is illustrated by real world data analysis from mitochondrial protein colocalization.
Both cluster randomized trials and quasi-experimental designs are used to evaluate the impact of health and social policies and interventions. Stepped-wedge cluster randomized trials randomize a staggered adoption approach, while recent difference-in-differences methods allow analysis of non-randomized settings where similar policies are adopted at different time points. These approaches have become common, but the sheer variety of methods for analyzing observational studies with staggered adoption makes it challenging to clearly design and report such studies. We propose that observational and quasi-experimental study investigators can address these challenges by emulating stepped-wedge cluster randomized trials in the target trial emulation framework. The conceptual framework and reporting standards of trial emulation will encourage consideration of key features of these designs, such as policy heterogeneity and time-varying effects, and clear reporting of the estimand and assumptions. It also highlights areas where those interested in randomized trials and quasi-experimental designs can benefit from one another's experience by bringing insights across disciplines. Questions of treatment effect heterogeneity, power, spillovers, and anticipation effects, among others, are common to both fields and can benefit from cross-pollination. This article also demonstrates how trial emulation can identify settings that are not well-served by either approach, thereby avoiding studies unlikely to generate high-quality causal evidence. Finally, it informs the bias-variance-generalizability trade-off that arises with design and analysis choices made in these settings, supporting better evidence generation and interpretation in settings where important questions can be answered.
Surrogate markers offer the potential to reduce the burden of data collection by replacing costly or invasive primary outcomes with more accessible measurements, provided that they can faithfully indicate the effectiveness of a treatment. However, appropriate evaluation of a surrogate is particularly complex in longitudinal studies, where both outcomes and surrogates can evolve dynamically over time and interest lies not only in the treatment effect at one time, but rather treatment effects that may vary along the entire trajectory. In this paper, we develop a statistical framework for surrogate evaluation when both the surrogate and primary outcome are measured over time. Specifically, within the potential outcomes framework, we propose a formal causal definition of the proportion of the treatment effect on the longitudinal primary outcome that is explained by the treatment effect on the longitudinal surrogate. For estimation, we leverage state-space models, together with the Kalman filter and smoother, enabling efficient estimation of treatment effects under realistic conditions of temporal evolution and patient-level variability. We introduce a nonparametric bootstrap strategy for state-space models, a temporal homogeneity test, and demonstrate the finite-sample performance of our proposed methods via a simulation study and application to a diabetes clinical trial.
Multivariate mixed-type outcomes are difficult to model jointly, and additional complexity arises when both marginal effects and dependence structures vary with a covariate such as age or time. Existing approaches often impose restrictive dependence assumptions or lack sufficient flexibility to accommodate heterogeneous response types in a unified framework. To address this issue, we propose a Bayesian nonparametric framework for multivariate conditional copula regression with varying coefficients. The proposed model combines adaptive spline-based marginal regressions with an infinite mixture of Gaussian copulas whose weights vary with the covariate through a probit stick-breaking process. This construction provides flexible covariate-dependent dependence modeling while avoiding explicit global constraints on functional correlation matrices. We further establish approximation results for the proposed copula representation and develop a Markov chain Monte Carlo algorithm for posterior inference. Simulation studies show accurate recovery under correct specification and robust performance under copula misspecification. In an analysis of the BRFSS 2023 data, the proposed model reveals age-varying marginal effects and dependence patterns among multiple health outcomes, providing a coherent joint view of multimorbidity beyond separate marginal analyses.
Bounding causal effects analytically, rather than numerically, is appealing for its interpretability and conceptual clarity. Existing sharp methods rely on optimization-based approaches such as the Balke-Pearl framework, whose computational complexity grows rapidly. An alternative line of work derives bounds heuristically using probability laws and generic inequalities, and some recent papers have claimed or conjectured that this approach can yield sharp analytical bounds with substantially lower complexity. In this paper, we show that this perceived advantage is illusory. In particular, in a discrete instrumental variable setting, we show that any sharp analytical bound for the average treatment effect must be expressible as a maximum (minimum) over a collection of linear terms whose cardinality grows exponentially in the number of values taken by the outcome. In parallel, we show that the number of instrumental variable inequalities itself also grows exponentially. Consequently, bounds and inequalities expressed using only polynomially many such terms cannot be sharp. As a constructive complement, the paper is accompanied by codes implemented in python and R to derive sharp analytical bounds and sharp inequalities with optimal efficiency, matching the lower bounds proven in this paper. These codes are available online.
When variable selection methods are applied to bootstrapped and multiply imputed datasets, the set of selected variables typically varies across iterations. Aggregating results via the union rule can lead to overly dense models. We propose a sequential evidence aggregation procedure that models detection outcomes across perturbation iterations as Bernoulli trials and accumulates evidence for variable relevance through a likelihood-ratio process admitting an approximate Bayes-factor interpretation. The procedure provides both a variable inclusion criterion and a stopping rule that eliminates the need to fix the number of bootstrap-imputation iterations ex ante. A Monte Carlo study across 126 scenarios and an empirical illustration demonstrate the method's performance relative to existing aggregation approaches.
The linked micromaps approach was originally developed as an improvement to choropleth maps for displaying statistical summaries connected with spatial areal units, such as countries, states, and counties. Two R packages to create linked micromaps were published in 2015. These are the micromap and micromapST packages. The latter was originally for data indexed to the 50 US states and DC, but the latest version accommodates arbitrary geographies. The micromapST package handles the formatting needed for linked micromaps and offers several options for statistical displays (scatterplots, boxplots, time series plots, and more). The micromapST package is very useful and takes care of most details of the layouts, but it can be problematic specifying the data frames needed to create the desired graphic. Furthermore, exploring data through visualization is easier, faster, and more intuitive using a graphical user interface. This is the motivation behind the R Shiny micromapST app. This paper will serve as a brief tutorial and introduction to micromapST and the Shiny app using real-world data and applications. In this paper, we provide background information on visualizing geographically indexed data and linked micromaps in Section 1. Section 2 discusses the data sets used in two illustrative examples. Sections 3 and 4 describe the application interface and show how it can create linked micromaps. The paper concludes with comments and future work.
This paper studies Graphical SLOPE for precision matrix estimation, with emphasis on its ability to recover both sparsity and clusters of edges with equal or similar strength. In a fixed-dimensional regime, we establish that the root-$n$ scaled estimation error converges to the unique minimizer of a strictly convex optimization problem defined through the directional derivative of the SLOPE penalty. We also establish convergence of the induced SLOPE pattern, thereby obtaining an asymptotic characterization of the clustering structure selected by the estimator. A comparison with GLASSO shows that the grouping property of SLOPE can substantially improve estimation accuracy when the precision matrix exhibits structured edge patterns. To assess the effect of departures from Gaussianity, we then analyze Gaussian-loss precision matrix estimation under elliptical distributions. In this setting, we derive the limiting distribution and quantify the inflation in variability induced by heavy tails relative to the Gaussian benchmark. We also study TSLOPE, based on the multivariate $t$-loss, and derive its limiting distribution. The results show that TSLOPE offers clear advantages over GSLOPE under heavy-tailed data-generating mechanisms. Simulation evidence suggests that these qualitative conclusions persist in high-dimensional settings, and an empirical application shows that SLOPE-based estimators, especially TSLOPE, can uncover economically meaningful clustered dependence structures.
This study investigates the relationship between longitudinal serum creatinine measurements and the risk of adverse kidney outcomes in paediatric patients with auto-immune disorders at Great Ormond Street Hospital for Children NHS Foundation Trust, London. To jointly analyse repeated biomarker measurements and time-to-event outcomes, we employed a joint modelling framework that combines the creatinine trajectories with the time to death or diagnosis of acute kidney injury or chronic kidney disease. Covariates considered in analysis included demographic and clinical characteristics. The results demonstrate a strong association between evolving creatinine profiles and the risk of the composite event. Specifically, treatment with corticosteroids and calcium channel blockers was associated with an increased event risk, whereas immunosuppressive therapy was associated with a reduced risk. The longitudinal component showed that creatinine trajectories were significantly influenced by age and BMI z-score. To demonstrate the practical utility of the proposed framework, dynamic risk predictions were generated using patients' observed creatinine trajectories. Model performance was compared using model selection criteria, alongside area under the curve and Brier score to evaluate the accuracy of dynamic risk predictions. These predictions illustrate the potential of joint models to support personalised medicine and clinical decision making in paediatric nephrology through real-time risk assessment.
Classical Fisher-information asymptotics describe the covariance of regular efficient estimators through the local quadratic approximation of the log-likelihood, and thus capture first-order geometry only. In curved models, including mixtures, curved exponential families, latent-variable models, and manifold-constrained parameter spaces, finite-sample behavior can deviate systematically from these predictions. We develop a coordinate-invariant, curvature-aware refinement by viewing a regular parametric family as a Riemannian manifold \((Θ,g)\) with Fisher--Rao metric, immersed in \(L^2(μ)\) through the square-root density map. Under suitable regularity and moment assumptions, we derive an \(n^{-2}\) correction to the leading \(n^{-1}I(θ)^{-1}\) covariance term for score-root, first-order efficient estimators. The correction is governed by a tensor \(P_{ij}\) that decomposes canonically into three parts, an intrinsic Ricci-type contraction of the Fisher--Rao curvature tensor, an extrinsic Gram-type contraction of the second fundamental form, and a Hellinger discrepancy tensor encoding higher-order probabilistic information not determined by immersion geometry alone. The extrinsic term is positive semidefinite, the full correction is invariant under smooth reparameterization, and it vanishes identically for full exponential families. We then extend the picture to singular models, where Fisher information degenerates. Using resolution of singularities under an additive normal crossing assumption, we describe the resolved metric, the role of the real log canonical threshold in learning rates and posterior mean-squared error, and a curvature-based covariance expansion on the resolved space that recovers the regular theory as a special case. This framework also suggests geometric diagnostics of weak identifiability and curvature-aware principles for regularization and optimization.
Sparse penalized quantile regression provides an effective framework for variable selection and robust estimation in high-dimensional data analysis. When ex planatory variables are organized into groups, achieving sparsity both within and between groups is essential. However, existing quantile regression methods often fail to meet this dual objective. To address this gap, we introduce the adaptive sparse group lasso penalized quantile regression, which integrates adaptive lasso and adaptive group lasso penalties. We optimize the model parameters via the alternating direction method of multipliers (ADMM) applied to the dual problem, and establish global convergence. Through extensive simulation studies and real data analyses, we demonstrate (i) the efficacy of the proposed method in achieving simultaneous within- and between-group sparsity, and (ii) the computational efficiency of our algorithm relative to existing alternatives.
There is a growing recognition of the importance to involve patients in every stage of drug development. This shift acknowledges that patients' perspectives, experiences, and preferences are essential for ensuring that treatments meet real-world needs. In this context, a new body of statistical literature has emerged, focusing not only on the simultaneous consideration of multiple outcomes that reflect patients' overall experiences, but also on their structured prioritization. We refer to this class of approaches as hierarchical multi-component statistical methods. Among these, two influential frameworks - generalized pairwise comparisons (GPC) and desirability of outcome ranking (DOOR) - have emerged in the last decade, each aiming to offer a comprehensive approach to evaluating treatment effects. A new methodology, referred to here as the Markov ordinal state transition model (MOST), has recently been introduced without focusing on an explicit link with GPC nor DOOR. This paper seeks to fill this gap by offering a comprehensive and comparative analysis of the three approaches. Through examples and an exploration of the structural and philosophical differences between the methods, our aim is to provide guidance and encourage lines of research in the rapidly-evolving landscape of hierarchical multi-component statistical methodologies.
This paper proposes a dynamic network framework for uncovering latent community paths in high-dimensional VAR-type models. By embedding a degree-corrected stochastic co-blockmodel (ScBM) into the transition matrices of VAR-type systems, we separate sending and receiving roles at the node level and summarize complex directional dependence in an interpretable low-dimensional form. Our method integrates directed spectral co-clustering with eigenvector smoothing to track how directional groups split, merge, or persist over time. This framework accommodates both periodic VAR (PVAR) models for cyclical seasonal evolution and generalized VHAR models for structural transitions across ordered dependence horizons. We establish non-asymptotic misclassification bounds for both procedures and provide supporting evidence through Monte Carlo experiments. Applications to U.S.\ nonfarm payrolls distinguish a recurrent business-centered core from more mobile, seasonally sensitive sectors. In global stock volatilities, the results reveal a compact U.S.-centered long-horizon block, a Europe-heavy developed core, and a more dynamic short-horizon reallocation of peripheral and bridge markets.
The menstrual cycle influences numerous physiological and psychological outcomes, yet standardised, open-source statistical methods for quantifying these cyclic effects remain lacking. We developed mcanalysis, an open-source package in R and Python implementing a Fourier-basis generalised additive model (GAM) for menstrual cycle research. The package provides a complete pipeline: processing period dates, labelling cycle days relative to menstruation onset, filtering physiologically plausible cycles, normalising outcomes to individual means, fitting cyclic GAMs with bootstrap confidence intervals, and identifying turning points to generate phase-specific linear trend estimates. We demonstrate the package on 15 wearable and self-reported outcomes using data from the Juli chronic health management application (N = 2,816 users). Nine of 15 outcomes showed evidence of association with the menstrual cycle (p < 0.05), spanning physiological (HRV p < 0.001, oxygen saturation p = 0.002), sleep (p = 0.003), symptom (migraine p < 0.001, headache p = 0.005), mood (EMA mood p = 0.024, PHQ-8 lack of energy p = 0.008, mania p = 0.041), and activity (hours outside p = 0.019) domains. No tested confounders were significantly associated with cycle-normalised outcomes. mcanalysis provides a standardised, reproducible approach to menstrual cycle analysis for users at all levels of statistical expertise. The package is freely available at https://github.com/kyradelray/mcanalysis, with a no-code web interface at https://kyradelray.shinyapps.io/mcanalysis/.
Expected Shortfall (ES) is a coherent measure of tail risk that captures the average loss beyond a quantile threshold. Despite the growing literature on ES regression conditional on covariates, no existing work considers ES modeling in panel data settings where both cross-sectional and temporal dependencies are present. This paper introduces the panel ES regression model with a latent factor structure to capture cross-sectional dependence. We develop a two-stage estimation procedure robust to heavy-tailed errors, recovering the conditional quantile in the first stage and iteratively estimating the ES factor model in the second stage. Theoretically, we establish the consistency and asymptotic normality of the proposed two-step ES estimators and derive non-asymptotic error bounds for both the panel quantile and ES estimators. We also provide a non-asymptotic normal approximation for the standardized ES regression estimator, bridging asymptotic theory and finite-sample practice. Simulation evidence shows that the proposed method delivers substantial gains in both parameter estimation and factor recovery, particularly in the presence of latent tail dependence. An empirical application further indicates that the extracted ES factors carry distinct pricing information that is not captured by conventional mean or quantile-based approaches.
This work presents a tractable approach to multi-object posterior computation under a generic measurement likelihood function. While filtering is a popular solution, valuable historical information is discarded. Posterior inference, which captures the full history of the multi-object states, provides a more comprehensive solution but is notoriously difficult and has received limited attention. Our proposed approach uses Gibbs Sampling (GS) to generate samples from the multi-object posterior. In particular, we establish that the conditional distributions of the multi-object posterior are Bernoulli random finite sets with explicit existence probabilities and attribute densities. These conditionals are straightforward to evaluate and sample from, enabling the construction of an efficient Gibbs sampler with standard convergence guarantees. To demonstrate its versatility, we develop the first multi-scan multi-object smoothing algorithm for superpositional measurements. Numerical experiments show that the proposed method delivers robust performance in challenging low-SNR scenarios where detection based smoothing deteriorates. Moreover, posterior samples obtained from our approach provide statistical characterizations of key variables and parameters, highlighting the advantages of posterior inference. This approach enriches multi-object estimation techniques, which historically lacked smoothing capabilities for non-standard measurements.
In-context learning enables transformers to adapt to new tasks from a few examples at inference time, while grokking highlights that this generalization can emerge abruptly only after prolonged training. We study task generalization and grokking in in-context learning using a Bayesian perspective, asking what enables the delayed transition from memorization to generalization. Concretely, we consider modular arithmetic tasks in which a transformer must infer a latent linear function solely from in-context examples and analyze how predictive uncertainty evolves during training. We combine approximate Bayesian techniques to estimate the posterior distribution and we study how uncertainty behaves across training and under changes in task diversity, context length, and context noise. We find that epistemic uncertainty collapses sharply when the model groks, making uncertainty a practical label-free diagnostic of generalization in transformers. Additionally, we provide theoretical support with a simplified Bayesian linear model, showing that asymptotically both delayed generalization and uncertainty peaks arise from the same underlying spectral mechanism, which links grokking time to uncertainty dynamics.
Extreme value theory offers a statistical framework for quantifying the risk of rare events, with the generalized Pareto (GP) distribution providing the canonical limit model for univariate threshold exceedances. In many applications, however, extremes are intrinsically multivariate, requiring models that capture both marginal tail behaviours and joint extremal dependencies. Under asymptotic dependence, the multivariate GP distribution represents a suitable modelling family, but when asymptotic independence arises, sub-asymptotic models are needed. In this work, we propose and study a flexible sub-asymptotic parametric class to model bivariate threshold exceedances. Our new model accommodates a broad range of tail dependence behaviours and contains the standardised multivariate GP distribution as a limiting case while retaining margins that converge to univariate GP tails. Our formulation allows extremal dependence to evolve naturally with the marginal parameters on the original data scale, facilitating direct computation and interpretation of failure probabilities. Model inference is done via a likelihood-free neural Bayes estimation approach, with tailored prior specifications. An extensive simulation study and an application to Belgian rainfall extremes illustrate the estimation framework and the flexibility of the model.
We decompose the Kullback--Leibler generalization error (GE) -- the expected KL divergence from the data distribution to the trained model -- of unsupervised learning into three non-negative components: model error, data bias, and variance. The decomposition is exact for any e-flat model class and follows from two identities of information geometry: the generalized Pythagorean theorem and a dual e-mixture variance identity. As an analytically tractable demonstration, we apply the framework to $ε$-PCA, a regularized principal component analysis in which the empirical covariance is truncated at rank $N_K$ and discarded directions are pinned at a fixed noise floor $ε$. Although rank-constrained $ε$-PCA is not itself e-flat, it admits a technical reformulation with the same total GE on isotropic Gaussian data, under which each component of the decomposition takes closed form. The optimal rank emerges as the cutoff $λ_{\mathrm{cut}}^{*} = ε$ -- the model retains exactly those empirical eigenvalues exceeding the noise floor -- with the cutoff reflecting a marginal-rate balance between model-error gain and data-bias cost. A boundary comparison further yields a three-regime phase diagram -- retain-all, interior, and collapse -- separated by the lower Marchenko--Pastur edge and an analytically computable collapse threshold $ε_{*}(α)$, where $α$ is the dimension-to-sample-size ratio. All claims are verified numerically.
Methods for quantifying the similarity of datasets are relevant in applications where two or more datasets, or their underlying distributions, need to be compared, ranging from two- and k-sample testing to applications in machine learning and synthetic data generation. Many methods for quantifying the similarity of datasets are available from the literature, but due to the lack of neutral comparison studies, it is unclear which method to choose when. Here, 36 methods applicable to continuous data are compared across various scenarios, including two or more datasets drawn from different distributions. Several deviations between datasets are considered, including shift and scale alternatives or differences in higher moments. An overall method ranking is established based on the methods' abilities to differentiate between datasets from different distributions, combined with computational aspects. Based on this, concrete decision rules for finding the best method based on characteristics of the datasets are determined. Moreover, combinations of four to six methods are proposed in the two-sample case such that in 90% to 95% of the considered scenarios, at least one of these methods is almost as good as the best method. In the multi-sample case, a combination of two to three methods is proposed analogously.
Reliable analysis of migration is critically dependent on the quality and consistency of the underlying data. Indian migration data, primarily derived from decennial census records, are affected by systematic gaps arising from uneven coverage and measurement inconsistencies across states and time. This paper presents a data-centric framework, HICM, for harmonizing Indian census migration data recorded under the Indian census and correcting prominent sources of bias prior to downstream analyses. We explicitly identify two types of bias across three decades of migration data: measurement bias and representativeness bias. We propose to address these gaps through principled pre-processing, mitigation, and validation strategies grounded in statistical diagnostics. An empirical evaluation of harmonized Indian interstate migration data reveals that bias-aware data correction substantially improves the consistency in the structure of the data and enhances the reliability of subsequent temporal analysis results. By improving data quality through reproducible data imputation and smoothing, this work advances migration analytics and provides a robust foundation for policy-relevant longitudinal network analysis of Indian internal migration.