Methodology

https://papers.cool/arxiv/stat.ME Methodology 2026-07-22T00:00:00+00:00 python-feedgen Cool Papers - Immersive Paper Discovery https://papers.cool/arxiv/2607.19311 GARTFIMA Models: A Class of Observation-Driven Models with Tempered Fractional Dynamics 2026-07-21T17:26:34+00:00 Guilherme Pumi Sharandeep Singh Pandher Taiane Schaedler Prass

This paper introduces a class of observation-driven models whose systematic component includes a tempered fractional differencing term. This specification generalizes long-range dependent models based on the fractional differencing operator, enabling a more general and robust model specification while offering theoretical advantages. We propose a partial maximum likelihood approach for parameter estimation and address hypothesis testing, confidence intervals, goodness-of-fit assessment, and both in-sample and out-of-sample forecasting. A Monte Carlo simulation study evaluates the finite-sample performance of the proposed estimation method, and an empirical application illustrates the model's practical utility.

https://papers.cool/arxiv/2607.19280 Rank-Based Estimation of U-Shaped Biomarker Risk Curves and Critical Points for Time-to-Event Outcomes 2026-07-21T16:51:01+00:00 Zhirui Fu Mei-Cheng Wang Yu Du Yuxin Zhu

U-shaped relationships between prognostic biomarker levels and adverse event risk are commonly observed across diseases, where both low and high biomarker values are associated with elevated risk, with a well-defined minimum -- the critical point -- marking the biomarker value of the lowest risk. The U-shaped risk curve, especially the location of the critical point, informs the identification of high- and low-risk subgroups. However, existing methods are limited: U-shaped risk models rarely accommodate survival outcomes, and existing survival analysis methods do not enable estimation of or formal inference for the critical point. To fill this gap, we propose a semiparametric transformation model that explicitly parameterizes the critical point, a rank-based maximum C-index estimator for the parametric component, and a smoothed Kaplan-Meier estimation approach for the nonparametric component. The resulting framework estimates both subgroup-specific U-shaped risk curves and their critical points within a single survival model. We establish consistency and asymptotic normality of the proposed estimators and demonstrate their finite-sample performance through numerical studies. We apply the proposed methods to UK Biobank data to characterize subgroup-specific U-shaped associations between body mass index and all-cause mortality and to identify the corresponding critical points.

https://papers.cool/arxiv/2607.19079 Morillas-type transformations of copulas and stable tail dependence functions 2026-07-21T13:10:49+00:00 Klaus Herrmann Marius Hofert Mélina Mailhot Nahid Sadr

A stochastic representation and sampling algorithm for Morillas-type copula-to-copula transformations and related distortions of multivariate distribution functions is derived, resulting as a byproduct in a novel sampling scheme for Archimedean and Archimax copulas. This closes a methodological gap and facilitates simulation-based applications of distorted copulas. For stable tail dependence functions (stdfs), a Morillas-type distortion framework is introduced, where monomial distortions with exponents below 1 are shown to preserve stdfs via a domain-restricted Pexider equation analysis. This characterization is leveraged to identify distortions preserving extreme value copulas, and convex combinations of distorted stdfs are proposed to increase flexibility in extremal dependence modeling. The impact of distortions on maximum domain of attraction limits is also analyzed. Explicit limiting EVC distortions are identified under non-restrictive regular variation assumptions. Examples of absolutely monotone distortions allowing to fine-tune the extreme value behavior after distortion, but also of non-regularly varying 2-absolutely monotone distortions are given.

https://papers.cool/arxiv/2607.19050 A Bayesian Approach to Causal Cure Models 2026-07-21T12:38:12+00:00 Emma Torrini Maïlis Amico Nicolas Molinari Clément Berenfeld

Time-to-event data often include individuals who will never experience the failure event, and are therefore considered cured. In such settings, frequently encountered in clinical research, a substantial proportion of patients may remain event-free throughout the observation period, leading to the appearance of a survival plateau that is commonly interpreted as evidence of a cured fraction. Analysing such data requires inference on the cure fraction and on the survival function for the uncured subpopulation, tasks which are traditionally achieved with mixture cure models. However, assessing the causal effect of a treatment on these quantities is non-trivial. We consider principal stratification causal estimands, which have been proposed to evaluate effects on the cure fraction and on the survival for an always-uncured stratum. We additionally introduce a novel estimand, which considers the causal effect on the survival for a non-always-cured union of strata. We frame the problem from a Bayesian model-based perspective, which provides a flexible and unified estimation strategy while maintaining a direct link with classical mixture cure model quantities. The reliability of the proposed approach is validated through simulations, demonstrating competitive and robust performance relative to existing methods. Finally, we illustrate its practical usefulness through an application to a randomized trial comparing non-invasive ventilation with standard oxygen therapy in patients with hypoxemic respiratory failure following abdominal surgery.

https://papers.cool/arxiv/2607.19043 Local Second-Order Geometry Induced by Deformation Maps 2026-07-21T12:34:04+00:00 Maria Laura Battagliola

Spatial deformations offer a flexible route to nonstationary dependence by warping the coordinates of a stationary random field. While the exact induced covariance depends on the deformation map in its entirety, we show that its behavior in a neighborhood is approximated accurately by linearization. This produces a tangent covariance whose discrepancy from the true covariance we bound explicitly, and its Fourier transform yields a local spectrum in closed form. Building on this spectral description, we introduce a simulation scheme that generates a deformed Gaussian field in a neighborhood accounting for the local spectrum, so that the simulated field reproduces the finite dimensional tangent covariance by construction. For repeated sampling across many reference points, a truncated singular value decomposition compresses the space and frequency weights into a reusable form. We further apply the summaries based on the local Jacobian as an exploratory device for deformations estimated from images, using cardiac magnetic resonance data from the Automated Cardiac Diagnosis Challenge together with optical flow. The resulting local geometry exhibits differences across diagnostic groups through directional and anisotropic features of myocardial deformation that go beyond simple measures of local expansion or compression.

https://papers.cool/arxiv/2607.18926 Informative Distance-Based Priors for Correlation Matrices Centred on a Target Reference 2026-07-21T10:09:53+00:00 Anna Freni-Sterrantino Janet van Niekerk Elias Teixeira Krainski Denis Rustand Haziq Jamil Håvard Rue

Specifying a prior over the space of correlation matrices is a persistent challenge in Bayesian analysis. The space is a curved manifold whose dimension grows quadratically with the number of variables, making substantive prior beliefs difficult to encode.\\ We propose a distance-based prior that assigns mass decaying exponentially in the Fisher arc-length distance from a user-specified reference correlation matrix, enabling shrinkage toward any target correlation structure rather than being confined to the identity matrix. Formally, this is constructed as a Penalised Complexity prior, but its interpretation shifts accordingly: unless the chosen target represents a structurally simpler state, the shrinkage penalises deviation rather than complexity in the usual sense. To accommodate conditional independence constraints, we introduce a parameterisation that constructs the correlation matrix via the Cholesky factor of the inverse correlation matrix with respect to a user-supplied graph, thereby reducing the number of free parameters from one per variable pair to one per graph edge. The prior is proper for every positive value of its rate parameter, accommodates correlations of either sign under any graph structure, and reduces to a fully unstructured prior when the graph is complete. A direct sampling algorithm is provided, enabling prior predictive checks and sensitivity analysis, implemented within the \texttt{graphpcor} package.

https://papers.cool/arxiv/2607.18836 Partial pooling predicts cross-validation reliability: a closed-form triage and Rao-Blackwellised cure for hierarchical LOO 2026-07-21T08:18:44+00:00 Aidan D Bindoff

For hierarchical models, Pareto-smoothed importance-sampling leave-one-out cross-validation (PSIS-LOO) fails on the folds where a random-effect coordinate is data-driven and its group is small. We show that the Gelman-Pardoe pooling factor and structural leverage predict these folds from model structure and group sizes, without forming importance weights. In Gaussian linear mixed models the leverage reduces to group size, giving a design-time map that separates the failing ($\hat{k}>0.7$) folds with AUC 0.96; across replicated logistic GLMMs the post-fit, weight-free predictor reaches AUC 0.81. The cure is integrated importance sampling: marginalise the random-effect block and importance-sample only the base parameters. This is not new, but we contribute its observation-level specialisation for random-intercept GLMMs: an analytic Gaussian downdate and a 1-D quadrature for Bernoulli, binomial and Poisson responses, packaged as a drop-in rb_loo(fit). Against exact refits, this marginalised estimator (RB-LOO) is $3\times$ more accurate than moment matching on singleton-heavy logistic GLMMs. On overdispersed count data with 97 failing folds, moment matching leaves 37 uncorrected and is no more accurate than raw PSIS-LOO, while RB-LOO reproduces the 82-minute exact refit (elpd RMSE 0.04) at no cost. The error changes decisions: against a negative-binomial model, PSIS-LOO reports decisive evidence ($z=4.9$) and reloo reports significant evidence ($z=3.4$) for the more complex model, where an exact analysis, reproduced by RB-LOO, finds the two indistinguishable ($z=1.0$). A base-fiber Schur decomposition splits case-deletion influence into a vertical (pooling) term that governs where PSIS-LOO fails and a horizontal (variance-component) term that governs where RB-LOO is itself strained, giving a two-level triage that recovers the exact answer while refitting only the few folds that need it.

https://papers.cool/arxiv/2607.18741 Variational Bayesian Sparse Negative Binomial Regression 2026-07-21T06:04:56+00:00 Mitra Kharabati Morteza Amini Mohammad Arashi

Count data with overdispersion and high-dimensional predictors pose significant challenges in modern applications. While negative binomial regression offers a flexible modeling framework, existing Bayesian approaches rely on computationally expensive MCMC methods that become impractical in high-dimensional settings. This paper develops a variational Bayesian framework for sparse negative binomial regression using horseshoe and continuous shrinkage priors. Our proposed methods achieve estimation accuracy and variable selection performance comparable to MCMC benchmarks while requiring less than 1\% of the computation time. Extensive simulations demonstrate that the negative binomial specification is essential for overdispersed data, as Poisson-based approaches exhibit substantial performance degradation under overdispersion. Conversely, our methods remain robust when the data are Poisson, making them a safer default choice. Applications to real benchmark datasets further confirm the practical utility of our approach. The proposed framework provides a computationally efficient and reliable tool for sparse count regression in high-dimensional settings.

https://papers.cool/arxiv/2607.18545 Flexible Inference for Winners with Conditional Validity 2026-07-20T22:23:27+00:00 Soham Bakshi Lingjun Gao Zijun Gao Snigdha Panigrahi

Researchers often select top-performing options or winners, based on a data-driven criterion, such as treatments, models, or model features and then report effect estimates for the selected winners. Naive post-selection estimates, however, are known to suffer from the winner's curse, producing systematically overoptimistic results. We introduce a flexible conditional inference method that corrects for this overoptimism through an adaptive exponential randomization scheme. Our method achieves selection quality that closely matches that of standard top-k selection, while also yielding shorter confidence intervals than existing approaches. Furthermore, our approach applies broadly to nonparametric settings with asymptotically linear selection statistics, covering wide-ranging applications such as inference for the efficacy of the most promising treatments in clinical trials, the abilities of top-ranked models on leaderboards, and the importance of the most predictive features in a model.

https://papers.cool/arxiv/2607.18503 Adaptive Penalization and Bootstrap-Smoothed Inference for Two-Sample Mendelian Randomization with Summary Data 2026-07-20T20:51:46+00:00 Muhammad Qasim Kai Wang Ishan S Bhatt

Two-sample Mendelian randomization (MR) uses genetic variants as instrumental variables to estimate causal effects from observational data using summary association statistics. However, horizontal pleiotropy can invalidate standard MR estimators and lead to biased causal inference. Pleiotropy-robust methods have been proposed to address this issue, including regularization-based approaches such as MR-Lasso. However, MR-Lasso may fail to identify invalid instruments consistently, and its post-selection inference can be unreliable. In this paper, we develop two lasso-type procedures for two-sample MR with summary-level data. The first, MR-ALasso, extends MR-Lasso by introducing adaptive penalty weights for pleiotropic effects in order to improve the identification of valid and invalid instruments. The second, MR-ALasso-B, combines adaptive lasso selection with bootstrap smoothing to improve post-selection inference. We establish theoretical results for MR-ALasso under the two-sample summary data framework, including invalid instrument identification consistency and oracle-type post-selection behavior. Simulation studies show that MR-ALasso generally improves upon MR-Lasso in estimation accuracy and invalid-instrument identification, whereas MR-ALasso-B substantially improves coverage and type-I error control relative to naive post-selection inference. A real-data application based on bidirectional analyses of multiple complex traits further illustrates the practical usefulness of the proposed methods. We provide an R package, MRAlasso, to facilitate implementation.

https://papers.cool/arxiv/2607.18501 SPYCE: A Doubly Robust Estimator for Trials Targeting Early Huntington Disease under Outcome-Dependent Censoring 2026-07-20T20:48:26+00:00 Kihyun Han Yanyuan Ma Karen Marder Tanya P. Garcia

Clinical trials for neurodegenerative diseases must identify sensitive endpoints -- outcomes that change rapidly enough to detect treatment effects. In Huntington disease, this requires measuring how outcomes change as participants approach Stage 1. Yet many participants exit studies before reaching this stage, making their time to Stage 1 right-censored. Estimating how outcomes change requires models for both time to Stage 1 and time to study exit. When participants with worse outcomes exit earlier, this outcome-dependent censoring causes existing estimators to produce contradictory results: for the same cognitive outcome, one estimator suggests improvement while another shows decline. Existing estimators either ignore outcome-dependent censoring or require one model to be correctly specified, with no protection when it is not. We introduce SPYCE, a doubly robust estimator (consistent when either model is correctly specified) that achieves the smallest possible variance and allows both models to be estimated nonparametrically without sacrificing efficiency. Applied to data from PREDICT-HD, an observational Huntington disease study, SPYCE resolves current contradictions, identifies caudate and putamen volume ratios as the most promising sensitive endpoints, and shows that as few as 241 participants per arm are needed to detect treatment effects, versus hundreds of thousands under estimators that cannot handle outcome-dependent censoring.

https://papers.cool/arxiv/2607.18431 Using binary silver labels in electronic health records-based computable phenotyping algorithms 2026-07-20T18:32:41+00:00 Shuhe Wang Matthew T. Slaughter Jennifer C. Nelson Brian D. Williamson

Gold-standard phenotype labels are often unavailable at scale in electronic health record (EHR) studies because they require manual chart review. Weakly supervised phenotyping methods instead use silver-standard labels, such as diagnosis-code counts, natural language processing (NLP) mentions, medication indicators, or laboratory thresholds. PheNorm is widely used for this purpose, but its original formulation was designed for count-valued silver labels and relies on log transformation, utilization normalization, and Gaussian mixture modeling. These steps are not directly suited to binary silver labels, which are common and may be highly informative. We propose Binary PheNorm, an extension that uses binary silver labels directly in the corruption-and-regression denoising step and produces a continuous phenotype score without EM calibration. We also consider a lasso-regularized version for high-dimensional EHR settings and combined models using both binary and count labels. In simulations, Binary PheNorm achieved strong discrimination using binary labels alone and often improved performance when combined with count labels. In anaphylaxis, AUC increased from 0.793 for an epinephrine-mention indicator to 0.891-0.892 after Binary PheNorm. In acute pancreatitis, AUC increased from 0.736 for a lipase-threshold indicator to 0.805-0.819. These results support Binary PheNorm as a practical weakly supervised approach when informative binary silver labels are available.

https://papers.cool/arxiv/2607.19206 Some cautionary tales about Bayesian predictive inference 2026-07-21T15:38:15+00:00 Emanuela Dreassi Fabrizio Leisen Luca Pratelli Pietro Rigo

Two misunderstandings, frequently arising in Bayesian predictive inference, are discussed. The first deals with the data generating mechanism, while the second consists in overestimating the role played by asymptotic exchangeability. Some consequences of such misunderstandings are highlighted through examples.

https://papers.cool/arxiv/2607.19080 The Influence Function of Transport-based Quantiles 2026-07-21T13:14:18+00:00 Alberto González-Sanz Shunan Sheng Bohan Wu Marco Avella Medina

Transport-based quantiles extend univariate quantiles to multivariate distributions via optimal transport. We study the influence function of the transport quantile map $\mathbf{Q}_P$, defined as the optimal transport map pushing a fixed reference measure $μ$ forward to a target distribution $P$. For the Huber contamination $P_t=(1-t)P+tδ_{x_0}$, we prove that the first-order limit $\mathbf{I}(x_0;\mathbf{Q}_P(z)) := \lim_{t\downarrow 0} [\mathbf{Q}_{(1-t)P+tδ_{x_0}}(z)-\mathbf{Q}_P(z)]/t$ exists whenever $x_0\ne \mathbf{Q}_P(z)$ and characterize it uniquely. Specifically, $\mathbf{I}(x_0;\mathbf{Q}_P(z))=\nabla G_{x_0}(z)$, where $G_{x_0}$ is characterized by a uniformly elliptic equation with a Dirac source and a Neumann boundary condition. In every dimension $d\ge 2$, this influence function has a pole-type singularity. For fixed $z\in\operatorname{int}(Ω_μ)$, it remains bounded when $\mathbf{F}_P(x_0)$ stays away from $z$, where $\mathbf{F}_P=\mathbf{Q}_P^{-1}$ is the transport-based distribution function, but diverges as $x_0\to\mathbf{Q}_P(z)$, equivalently as $\mathbf{F}_P(x_0)\to z$. In fact, $\|\mathbf{I}(x_0;\mathbf{Q}_P(z))\|\asymp\|z-\mathbf{F}_P(x_0)\|^{-(d-1)}$. This contrasts with the bounded influence function of univariate quantiles and implies that $\mathbf{I}(X;\mathbf{Q}_P(z))$, for $X\sim P$, has infinite second moment. Numerical experiments further suggest that empirical transport quantiles may exhibit stable-type non-Gaussian fluctuations.

https://papers.cool/arxiv/2607.18861 Learning sufficient low-dimensional structures through conditional optimal transport 2026-07-21T08:51:44+00:00 Kaiqiang Alan Zeng Efstathia Bura

Sufficient dimension reduction seeks a low-dimensional covariate representation that preserves the conditional law of a response. We introduce SDR-COT, which represents that law by conditional optimal transport from an independent reference response. On separable Hilbert spaces, sufficiency forces the response component of the optimal triangular map to factor through the reduction. For quadratic cost, the induced interpolation has a Borel current-state velocity on every truncated time interval, without global injectivity of the terminal map, and this velocity has the same factorisation. These results motivate a conditional-flow-matching criterion. For linear reductions, we prove consistency using a suitably tuned relaxed empirical coupling. Euclidean responses are treated through slicewise Caffarelli bounds; Hilbert-valued responses are treated through Gaussian Sobolev regularity, interpolation compression and uniqueness of a Gaussian continuity equation. Numerical studies with Euclidean and functional data show competitive performance, especially when sufficient information is not solely contained in the conditional mean.

https://papers.cool/arxiv/2607.18734 Uncertainty quantification in mechanics: A unified Bayesian perspective 2026-07-21T05:46:28+00:00 Sascha Ranftl Malte Rolf Gerhard A. Holzapfel Ellen Kuhl

Uncertainty quantification (UQ) is essential to experimental mechanics, but has become particularly relevant in computational mechanics, manifesting in two fundamental problem types: forward and inverse problems. The former addresses how input uncertainties propagate to the quantities of interest, whereas the latter aims to infer unknown parameters from experimental observations or simulations. Since efficient propagation typically requires a prohibitive number of evaluations to compute marginal output distributions, the development of fast, data-driven surrogate models becomes necessary. Thus, we can distinguish between two inverse tasks: (i) the identification and calibration of input uncertainties, and (ii) the construction of surrogates, a methodology collectively referred to as surrogate-based UQ. Building on probabilistic reasoning and the concept of partial belief, we demonstrate that Bayesian probability theory provides a unified theoretical framework for addressing both problem types. We further show that Bayesian inference allows for the seamless incorporation of essential subproblems, including model selection for identifying the most probable model specifications and experimental design for optimizing data collection by identifying experiments or simulations that maximize expected information gain about parameters, among others such as connections to sensitivity analysis or the use of special priors like random fields. While this theoretical framework is presented for general mechanical problems, particular emphasis is placed on biomechanics, where variability and uncertainty is especially pronounced due to inherent biological heterogeneity, patient-specific variability, and noisy data.

https://papers.cool/arxiv/2607.18661 Gaffke's confidence interval for the mean of bounded data is inadmissible but asymptotically efficient 2026-07-21T03:14:21+00:00 Jiahao Ming Aaditya Ramdas Yi Shen Ruodu Wang Ian Waudby-Smith

Given observations $\mathbf x=(x_1,\dots,x_n)$, Gaffke (2005) defined \[ K_n(\mathbf x)=\mathbb{P}_{\mathbf D}\!\left\{\sum_{i=1}^n x_iD_i\le 1\right\}, \qquad (D_0,D_1,\ldots,D_n)\sim\mathrm{Dirichlet}(1,\ldots,1), \] and conjectured that it is a $p$-value whenever the inputs are independent e-values. Recently, Vlassis and Thomas (2026) proved this conjecture. Inverting the tests for observations in $[0,1]$ gives the confidence interval studied by Learned-Miller and Thomas (2020), which reduces to Clopper--Pearson for Bernoulli data. We give a finite- and large-sample account of Gaffke's test and interval. First, for every $\mathbf x\in[0,\infty)^n$ and every elementary symmetric polynomial $e_k$, $ K_n(\mathbf x)e_k(\mathbf x)\le {n\choose k}, $ so the Gaffke $p$-value never larger than the SymPol $p$-value of Ming et al. (2026). However, Gaffke's p-value is inadmissible. For $n=2$, we construct a valid rule that is strictly smaller on mixed configurations and is the unique admissible rule that dominates $K_2$. A neutral-face extension proves inadmissibility of $K_n$ for every $n\ge2$. If one independent uniform random variable is allowed, there is an even simpler full-dimensional improvement: on the upper orthant, where $K_n(\mathbf x)=1/\prod_i x_i$, replace it by $U/\prod_i x_i$. The equal-tail Gaffke confidence interval $I_n$ is nevertheless first-order asymptotically efficient: for iid observations on $[0,1]$ with unknown variance $σ^2>0$, \[ \sqrt n\,\operatorname{Width}(I_n)\longrightarrow 2σz_{1-α/2}\qquad\text{almost surely}. \] Our simulations also find that, among a variety of bounded-mean intervals considered, the Gaffke interval is the shortest, including comparisons with a recent empirical Berry--Esseen procedure having the same first-order Gaussian target.

https://papers.cool/arxiv/2607.18364 Evaluating the Impact of Epidemic Control via State-Dependent Markovian Switching Modeling 2026-07-20T16:08:29+00:00 Vasileios E. Papageorgiou Irene Votsi Samis Trevezas

We develop an exact finite-population stochastic framework for SIR epidemics evolving under Markovian switching between intervention regimes. The epidemic state is augmented by a finite phase component, allowing transmission, recovery, and direct immunity-acquisition rates to depend on the active regime. Phase-transition intensities may depend on the current epidemic state, so that policy escalation can react to the number of infectious individuals. Exploiting the monotonicity of the susceptible compartment, we derive level-wise recursions for the joint Laplace--Stieltjes transform and probability generating function of the extinction time and the number of infections generated before extinction. These recursions yield the infection-count distribution, conditional extinction-time transforms, and mixed moments linking epidemic duration and infection burden, while replacing a large global linear system with small phase-level solves. The framework is illustrated using weekly mpox incidence data from Luxembourg. A baseline one-phase SIR model is calibrated by maximum likelihood under a Poisson observation model. The calibrated baseline is then used for conditional comparisons of fixed control regimes, early versus delayed strict intervention, vaccination-supported control, and state-dependent escalation. The results show how switching mechanisms affect both the total number of infected individuals and the extinction time, including their dispersion. Since the switching mechanisms are specified rather than estimated from the intervention history, the results are conditional model-based comparisons rather than estimates of the historical effects of interventions in Luxembourg.