2025-06-30 | | Total: 44
While the theory of deep learning has made some progress in recent years, much of it is limited to the ReLU activation function. In particular, while the neural tangent kernel (NTK) and neural network Gaussian process kernel (NNGP) have given theoreticians tractable limiting cases of fully connected neural networks, their properties for most activation functions except for powers of the ReLU function are poorly understood. Our main contribution is to provide a more general characterization of the RKHS of these kernels for typical activation functions whose only non-smoothness is at zero, such as SELU, ELU, or LeakyReLU. Our analysis also covers a broad set of special cases such as missing biases, two-layer networks, or polynomial activations. Our results show that a broad class of not infinitely smooth activations generate equivalent RKHSs at different network depths, while polynomial activations generate non-equivalent RKHSs. Finally, we derive results for the smoothness of NNGP sample paths, characterizing the smoothness of infinitely wide neural networks at initialization.
This paper jointly addresses the challenges of non-stationarity and high dimensionality in analysing multivariate time series. Building on the classical concept of cointegration, we introduce a more flexible notion, called stability space, aimed at capturing stationary components in settings where traditional assumptions may not hold. We examine the parametric Johansen procedure alongside two non-parametric alternatives based on dimensionality reduction techniques: Partial Least Squares and Principal Component Analysis. Additionally, we propose a targeted selection of components that prioritises stationarity. Through simulations and real-data applications, we evaluated the performance of these methodologies across various scenarios, including high-dimensional configurations.
We propose a simple and intuitive test for arguably the most prevailing hypothesis in statistics that data are independent and identically distributed (IID), based on a newly introduced off-diagonal sequential U-process. This IID test is fully nonparametric and applicable to random objects in general spaces, while requiring no specific alternatives such as structural breaks or serial dependence, which allows for detecting general types of violations of the IID assumption. An easy-to-implement jackknife multiplier bootstrap is tailored to produce critical values of the test. Under mild conditions, we establish Gaussian approximation for the proposed U-processes, and derive non-asymptotic coupling and Kolmogorov distance bounds for its maximum and the bootstrapped version, providing rigorous theoretical guarantees. Simulations and real data applications are conducted to demonstrate the usefulness and versatility compared with existing methods.
In Major League Baseball, every ballpark is different, with different dimensions and climates. These differences make some ballparks more conducive to hitting home runs than others. Several factors conspire to make estimation of these differences challenging. Home runs are relatively rare, occurring in roughly 3\% of plate appearances. The quality of personnel and the frequency of batter-pitcher handedness combinations that appear in the thirty ballparks vary considerably. Because of asymmetries, effects due to ballpark can depend strongly on hitter handedness. We consider generalized linear mixed effects models based on the Poisson distribution for home runs. We use as our observational unit the combination of game and handedness-matchup. Our model allows for four theoretical mean home run frequency functions for each ballpark. We control for variation in personnel across games by constructing ``elsewhere'' measures of batter ability to hit home runs and pitcher tendency to give them up, using data from parks other than the one in which the response is observed. We analyze 13 seasons of data and find that the estimated home run frequencies adjusted to average personnel are substantially different from observed home run frequencies, leading to considerably different ballpark rankings than often appear in the media.
Identifying frail older adults in an ageing population has become crucial to improve the services offered by a healthcare system. This work aims to develop a composite indicator to assess the level of frailty of individuals using administrative health data. Since frailty has a complex, multidimensional nature, a multi-outcome approach was used. After an extensive literature research, some health adverse events were identified to represent frailty. These adverse events were modelled by logistic classifiers, with frailty determinants (previously selected with a gradient tree boosting) used as covariates. The sensitivity and specificity of individual classifiers are exploited to rewrite their combined likelihood. From this combined likelihood, we are able to obtain an indicator capable of measuring the frailty of the population. The indicator demonstrates strong performance across all outcomes for various years. The main innovation brought by this indicator lies in the possibility to use diverse subgroups of frailty determinants specific to each outcome, without imposing constraints on their structural form. In conclusion, this indicator has proven to be an efficient tool for quantifying frailty of elderly individuals, providing potential help to health authorities in preventing frailty-related adverse events.
Text watermarks in large language models (LLMs) are an increasingly important tool for detecting synthetic text and distinguishing human-written content from LLM-generated text. While most existing studies focus on determining whether entire texts are watermarked, many real-world scenarios involve mixed-source texts, which blend human-written and watermarked content. In this paper, we address the problem of optimally estimating the watermark proportion in mixed-source texts. We cast this problem as estimating the proportion parameter in a mixture model based on \emph{pivotal statistics}. First, we show that this parameter is not even identifiable in certain watermarking schemes, let alone consistently estimable. In stark contrast, for watermarking methods that employ continuous pivotal statistics for detection, we demonstrate that the proportion parameter is identifiable under mild conditions. We propose efficient estimators for this class of methods, which include several popular unbiased watermarks as examples, and derive minimax lower bounds for any measurable estimator based on pivotal statistics, showing that our estimators achieve these lower bounds. Through evaluations on both synthetic data and mixed-source text generated by open-source models, we demonstrate that our proposed estimators consistently achieve high estimation accuracy.
Dengue is an infectious disease which poses significant socioeconomic and disease burden in many tropical and subtropical regions of the world. This work aims to provide additional insight into the association between dengue and climate in the Philippines. We employ a two-stage modelling framework: the first stage fits climate models, while the second stage fits a health model that uses the climate predictions from the first stage as inputs. We postulate a Bayesian spatio-temporal model and use the integrated nested Laplace approximation (INLA) approach for inference. To account for the uncertainty in the climate models, we perform posterior sampling and then perform Bayesian model averaging to compute the final posterior estimates of second-stage model parameters. The results indicate that temperature is positively associated with dengue, although extremely hot conditions tend to have a negative effect. Moreover, the relationship between rainfall and dengue varies in space. In areas with uniform amounts of rainfall all year round, rainfall is negatively associated with dengue. In contrast, in regions with pronounced dry and wet season, rainfall shows a positive association with dengue. Finally, there remains unexplained structured variation in space and time after accounting for the impact of climate variables and other covariates.
Power and sample size calculations for Wald tests in generalized linear models (GLMs) are often limited to specific cases like logistic regression. More general methods typically require detailed study parameters that are difficult to obtain during planning. We introduce two new effect size measures for estimating power, sample size, or the minimally detectable effect size in studies using Wald tests across any GLM. These measures accommodate any number of predictors or adjusters and require only basic study information. We provide practical guidance for interpreting and applying these measures to approximate a key parameter in power calculations. We also derive asymptotic bounds on the relative error of these approximations, showing that accuracy depends on features of the GLM such as the nonlinearity of the link function. To complement this analysis, we conduct simulation studies across common model specifications, identifying best use cases and opportunities for improvement. Finally, we test the methods in finite samples to confirm their practical utility.
Pharmacokinetic modeling using ordinary differential equations (ODEs) has an important role in dose optimization studies, where dosing must balance sustained therapeutic efficacy with the risk of adverse side effects. Such ODE models characterize drug plasma concentration over time and allow pharmacokinetic parameters to be inferred, such as drug absorption and elimination rates. For time-course studies involving treatment groups with multiple subjects, mixed-effects ODE models are commonly used. However, existing methods tend to lack uncertainty quantification on a subject-level, for key measures such as peak or trough concentration and for making predictions of drug concentration. To address such limitations, we propose an extension of manifold-constrained Gaussian processes for inference of general mixed-effects ODE models within a Bayesian statistical framework. We evaluate our method on simulated examples, demonstrating its ability to provide fast and accurate inference for parameters and trajectories using nested optimization. To illustrate the practical efficacy of the proposed method, we provide a real data analysis of a pharmacokinetic model used for an HIV combination therapy study.
We establish bounds on the conductance for the systematic-scan and random-scan Gibbs samplers when the target distribution satisfies a Poincare or log-Sobolev inequality and possesses sufficiently regular conditional distributions. These bounds lead to mixing time guarantees that extend beyond the log-concave setting, offering new insights into the convergence behavior of Gibbs sampling in broader regimes. Moreover, we demonstrate that our results remain valid for log-Lipschitz and log-smooth target distributions. Our approach relies on novel three-set isoperimetric inequalities and a sequential coupling argument for the Gibbs sampler.
The integration of the history and philosophy of statistics was initiated at least by Hacking (1965) and advanced by Mayo (1996), but it has not received sustained follow-up. Yet such integration is more urgent than ever, as the recent success of artificial intelligence has been driven largely by machine learning -- a field historically developed alongside statistics. Today, the boundary between statistics and machine learning is increasingly blurred. What we now need is integration, twice over: of history and philosophy, and of the field they engage -- statistics and machine learning. I present a case study of a philosophical idea in machine learning (and in formal epistemology) whose root can be traced back to an often under-appreciated insight in Neyman and Pearson's 1936 work (a follow-up to their 1933 classic). This leads to the articulation of a foundational assumption -- largely implicit in, but shared by, the practices of frequentist statistics and machine learning -- which I call achievabilism. Another integration also emerges at the level of methodology, combining two ends of the philosophy of science spectrum: history and philosophy of science on the one hand, and formal epistemology on the other hand.
Single-cell sequencing is revolutionizing biology by enabling detailed investigations of cell-state transitions. Many biological processes unfold along continuous trajectories, yet it remains challenging to extract smooth, low-dimensional representations from inherently noisy, high-dimensional single-cell data. Neighbor embedding (NE) algorithms, such as t-SNE and UMAP, are widely used to embed high-dimensional single-cell data into low dimensions. But they often introduce undesirable distortions, resulting in misleading interpretations. Existing evaluation methods for NE algorithms primarily focus on separating discrete cell types rather than capturing continuous cell-state transitions, while dynamic modeling approaches rely on strong assumptions about cellular processes and specialized data. To address these challenges, we build on the Predictability-Computability-Stability (PCS) framework for reliable and reproducible data-driven discoveries. First, we systematically evaluate popular NE algorithms through empirical analysis, simulation, and theory, and reveal their key shortcomings, such as artifacts and instability. We then introduce NESS, a principled and interpretable machine learning approach to improve NE representations by leveraging algorithmic stability and to enable robust inference of smooth biological structures. NESS offers useful concepts, quantitative stability metrics, and efficient computational workflows to uncover developmental trajectories and cell-state transitions in single-cell data. Finally, we apply NESS to six single-cell datasets, spanning pluripotent stem cell differentiation, organoid development, and multiple tissue-specific lineage trajectories. Across these diverse contexts, NESS consistently yields useful biological insights, such as identification of transitional and stable cell states and quantification of transcriptional dynamics during development.
Physics phenomena are often described by ordinary and/or partial differential equations (ODEs/PDEs), and solved analytically or numerically. Unfortunately, many real-world systems are described only approximately with missing or unknown terms in the equations. This makes the distribution of the physics model differ from the true data-generating process (DGP). Using limited and unpaired data between DGP observations and the imperfect model simulations, we investigate this particular setting by completing the known-physics model, combining theory-driven models and data-driven to describe the shifted distribution involved in the DGP. We present a novel hybrid generative model approach combining deep grey-box modelling with Optimal Transport (OT) methods to enhance incomplete physics models. Our method implements OT maps in data space while maintaining minimal source distribution distortion, demonstrating superior performance in resolving the unpaired problem and ensuring correct usage of physics parameters. Unlike black-box alternatives, our approach leverages physics-based inductive biases to accurately learn system dynamics while preserving interpretability through its domain knowledge foundation. Experimental results validate our method's effectiveness in both generation tasks and model transparency, offering detailed insights into learned physics dynamics.
Many real-world spatio-temporal processes exhibit nonlinear dynamics that can often be described through stochastic partial differential equations. These models are flexible and scientifically motivated, however, implementing them in a fully Bayesian framework can be computationally challenging. We are motivated by birth rate data, which has important implications for public health and are known to follow nonlinear dynamics. We propose a covariance calibration strategy that specifies the covariance matrix of a linear mixed effects model to be close in Frobenius norm to that of a Generalized Quadratic Nonlinearity (GQN) model. We refer to this as Frobenius norm matching. This allows us to model nonlinear dynamics using an easier to implement linear framework. The calibrated linear model is efficiently implemented using Exact Posterior Regression (EPR), a recently proposed Bayesian model that enables sampling of fixed and random effects directly from the posterior distribution. We provide simulation studies that compare to implementations using MCMC. Finally, we use this approach to analyze Florida county-level birth rate data from 1990-2023. Our results indicate that our non-linear spatio-temporal model outperforms linear dynamic spatio-temporal models for this data, and identifies covariate effects consistent with existing literature, all while avoiding the computational difficulties of MCMC.
These notes describe our experience with running a student seminar on average-case complexity in statistical inference using the jigsaw learning format at ETH Zurich in Fall of 2024. The jigsaw learning technique is an active learning technique where students work in groups on independent parts of the task and then reassemble the groups to combine all the parts together. We implemented this technique for the proofs of various recent research developments, combined with a presentation by one of the students in the beginning of the session. We describe our experience and thoughts on such a format applied in a student research seminar: including, but not limited to, higher engagement, more accessible talks by the students, and increased student participation in discussions. In the Appendix, we include all the exercises sheets for the topic, which may be of independent interest for courses on statistical-to-computational gaps and average-case complexity.
In this paper, we derive closed-form expressions for the bias of estimators of the Theil, Atkinson, and dispersion indices when the underlying population follows a finite mixture of gamma distributions. Our methodology builds on probabilistic techniques grounded in Mosimann's proportion-sum independence theorem and the gamma-Dirichlet connection, enabling analytical tractability in the presence of population heterogeneity. These results extend existing findings for single gamma models.
Network meta-analysis (NMA) synthesizes evidence for multiple treatments, but decisions on node formation can have important statistical implications including bias or inflated uncertainty. Existing data-driven methods often lack flexibility or fail to fully account for node uncertainty and adjust for between-trial heterogeneity simultaneously. We introduce a Bayesian non-parametric framework using a Dirichlet process prior with a regularized horseshoe base measure. This data-driven approach allows treatments to cluster based on their effects while formally propagating uncertainty about the clustering structure itself. We extend this method to incorporate baseline risk meta-regression, enabling clustering even under heterogeneity, and demonstrate implementation using standard MCMC software. We apply the method to case studies in rheumatology and pain and find adjusting for baseline risk heterogeneity can substantially change which treatments are clustered together, highlighting the importance of methods to allow for meta-regression.
Billera-Holmes-Vogtmann (BHV) tree space is a geodesic metric space of edge-weighted phylogenetic trees with a fixed leaf set. Constructing parametric distributions on this space is challenging due to its non-Euclidean geometry and the intractability of normalizing constants. We address this by fitting Brownian motion transition kernels to tree-valued data via a non-Euclidean bridge construction. Each kernel is determined by a source tree x0 (the Brownian motion's starting point) and a dispersion parameter t0 (its duration). Observed trees are modelled as independent draws from the transition kernel defined by (x0,t0), analogous to a Gaussian model in Euclidean space. Brownian motion is approximated by an m-step random walk, with the parameter space augmented to include full sample paths. We develop a bridge algorithm to sample paths conditional on their endpoints, and introduce methods for sampling a Bayesian posterior for (x0,t0) and for marginal likelihood evaluation. This enables hypothesis testing for alternative source trees. The approach is validated on simulated data and applied to an experimental data set of yeast gene trees. These methods provide a foundation for future development of a wider class of probabilistic models of tree-valued data.
Nonlinear dimension reduction (NLDR) techniques such as tSNE, and UMAP provide a low-dimensional representation of high-dimensional data (\pD{}) by applying a nonlinear transformation. NLDR often exaggerates random patterns. But NLDR views have an important role in data analysis because, if done well, they provide a concise visual (and conceptual) summary of \pD{} distributions. The NLDR methods and hyper-parameter choices can create wildly different representations, making it difficult to decide which is best, or whether any or all are accurate or misleading. To help assess the NLDR and decide on which, if any, is the most reasonable representation of the structure(s) present in the \pD{} data, we have developed an algorithm to show the \gD{} NLDR model in the \pD{} space, viewed with a tour, a movie of linear projections. From this, one can see if the model fits everywhere, or better in some subspaces, or completely mismatches the data. Also, we can see how different methods may have similar summaries or quirks.
We provide a characterization for the continuous positive definite kernels on Rd that are invariant to linear isometries, i.e. invariant under the orthogonal group O(d). Furthermore, we provide necessary and sufficient conditions for these kernels to be strictly positive definite. This class of isotropic kernels is fairly general: First, it unifies stationary isotropic and dot product kernels, and second, it includes neural network kernels that arise from infinite-width limits of neural networks.
With the rise of the network perspective, researchers have made numerous important discoveries over the past decade by constructing psychological networks. Unfortunately, most of these networks are based on cross-sectional data, which can only reveal associations between variables but not their directional or causal relationships. Recently, the development of the nodeIdentifyR algorithm (NIRA) technique has provided a promising method for simulating causal processes based on cross-sectional network structures. However, this algorithm is not capable of handling cross-sectional nested data, which greatly limits its applicability. In response to this limitation, the present study proposes a multilevel extension of the NIRA algorithm, referred to as multilevel NIRA. We provide a detailed explanation of the algorithm's core principles and modeling procedures. Finally, we discuss the potential applications and practical implications of this approach, as well as its limitations and directions for future research.
Drought is a significant natural phenomenon with profound environmental, economic, and societal impacts. Effective monitoring of drought characteristics -- such as intensity, magnitude, and duration -- is crucial for resilience and mitigation strategies. This study proposes the use of non-parametric Time Between Events and Amplitude (TBEA) control charts for detecting changes in drought characteristics, specifically applying them to the Standardized Precipitation and Evapotranspiration Index. Aware of being non-exhaustive, we considered two non-parametric change-point control charts based on the Mann-Whitney and Kolmogorov-Smirnov statistics, respectively. We studied the in-control statistical performances of the change-point control charts in the time between events and amplitude framework through a simulation study. Furthermore, we assessed the coherence of the results obtained with a distribution-free upper sided Exponentially Weighted Moving Average control chart specifically designed for monitoring TBEA data. The findings suggest that the proposed methods may serve as valuable tools for climate resilience planning and water resource management.
Selecting prior distributions in Bayesian statistics is challenging, resource-intensive, and subjective. We analyze using large-language models (LLMs) to suggest suitable, knowledge-based informative priors. We developed an extensive prompt asking LLMs not only to suggest priors but also to verify and reflect on their choices. We evaluated Claude Opus, Gemini 2.5 Pro, and ChatGPT-4o-mini on two real datasets: heart disease risk and concrete strength. All LLMs correctly identified the direction for all associations (e.g., that heart disease risk is higher for males). The quality of suggested priors was measured by their Kullback-Leibler divergence from the maximum likelihood estimator's distribution. The LLMs suggested both moderately and weakly informative priors. The moderate priors were often overconfident, resulting in distributions misaligned with the data. In our experiments, Claude and Gemini provided better priors than ChatGPT. For weakly informative priors, a key performance difference emerged: ChatGPT and Gemini defaulted to an "unnecessarily vague" mean of 0, while Claude did not, demonstrating a significant advantage. The ability of LLMs to identify correct associations shows their great potential as an efficient, objective method for developing informative priors. However, the primary challenge remains in calibrating the width of these priors to avoid over- and under-confidence.
Flexible modelling of the autocovariance function (ACF) is central to time-series, spatial, and spatio-temporal analysis. Modern applications often demand flexibility beyond classical parametric models, motivating non-parametric descriptions of the ACF. Bochner's Theorem guarantees that any positive spectral measure yields a valid ACF via the inverse Fourier transform; however, existing non-parametric approaches in the spectral domain rarely return closed-form expressions for the ACF itself. We develop a flexible, closed-form class of non-parametric ACFs by deriving the inverse Fourier transform of B-spline spectral bases with arbitrary degree and knot placement. This yields a general class of ACF with three key features: (i) it is provably dense, under an L1 metric, in the space of weakly stationary, mean-square continuous ACFs with mild regularity conditions; (ii) it accommodates univariate, multivariate, and multidimensional processes; and (iii) it naturally supports non-separable structure without requiring explicit imposition. Jackson-type approximation bounds establish convergence rates, and empirical results on simulated and real-world data demonstrate accurate process recovery. The method provides a practical and theoretically grounded approach for constructing a non-parametric class of ACF.
Despite a global decline in motor vehicle crash fatalities due to improved research and road safety policies, road traffic injuries remain a significant public health concern. The World Health Organization 2023 report highlights that road traffic injuries are the leading cause of death among individuals aged 5-29, with over half of fatalities involving pedestrians, cyclists, and motorcyclists. This study addresses this critical issue by identifying high-risk areas in Montgomery County, Maryland, contributing to the global goal of halving road traffic deaths and injuries by 2030. Using Kernel Density Estimation (KDE) and spatial autocorrelation analysis, we estimate collision densities and identify hotspots for targeted interventions. Our findings reveal significant spatial clustering of traffic collisions, with distinct patterns in densely populated urban areas and rural regions, offering valuable insights for policymakers to enhance road safety.