2025-03-28 | | Total: 57
The fact that redundant information does not change a rational belief after Bayesian updating implies uniqueness of Bayes rule. In fact, any updating rule is uniquely specified by this principle. This is true for the classical setting, as well as settings with improper or continuous priors. We prove this result and illustrate it with two examples.
This paper proposes a nonparametric test of independence of one random variable from a large pool of other random variables. The test statistic is the maximum of several Chatterjee's rank correlations and critical values are computed via a block multiplier bootstrap. The test is shown to asymptotically control size uniformly over a large class of data-generating processes, even when the number of variables is much larger than sample size. The test is consistent against any fixed alternative. It can be combined with a stepwise procedure for selecting those variables from the pool that violate independence, while controlling the family-wise error rate. All formal results leave the dependence among variables in the pool completely unrestricted. In simulations, we find that our test is very powerful, outperforming existing tests in most scenarios considered, particularly in high dimensions and/or when the variables in the pool are dependent.
The presence or absence of winner-loser effects is a widely discussed phenomenon across both sports and psychology research. Investigation of such effects is often hampered by the limited availability of data. Online chess has exploded in popularity in recent years and provides vast amounts of data which can be used to explore this question. With a hierarchical Bayesian regression model, we carefully investigate the presence of such experiential effects in online chess. Using a large quantity of online chess data, we see little evidence for experiential effects that are consistent across all players, with some individual players showing some evidence for such effects. Given the challenging temporal nature of this data, we discuss several methods for assessing the suitability of our model and carefully check its validity.
Decision making under uncertainty is a cross-cutting challenge in science and engineering. Most approaches to this challenge employ probabilistic representations of uncertainty. In complicated systems accessible only via data or black-box models, however, these representations are rarely known. We discuss how to characterize and manipulate such representations using triangular transport maps, which approximate any complex probability distribution as a transformation of a simple, well-understood distribution. The particular structure of triangular transport guarantees many desirable mathematical and computational properties that translate well into solving practical problems. Triangular maps are actively used for density estimation, (conditional) generative modelling, Bayesian inference, data assimilation, optimal experimental design, and related tasks. While there is ample literature on the development and theory of triangular transport methods, this manuscript provides a detailed introduction for scientists interested in employing measure transport without assuming a formal mathematical background. We build intuition for the key foundations of triangular transport, discuss many aspects of its practical implementation, and outline the frontiers of this field.
Most Kalman filters for non-linear systems, such as the unscented Kalman filter, are based on Gaussian approximations. We use Poincaré inequalities to bound the Wasserstein distance between the true joint distribution of the prediction and measurement and its Gaussian approximation. The bounds can be used to assess the performance of non-linear Gaussian filters and determine those filtering approximations that are most likely to induce error.
We revisit the discrete argmin inference problem in high-dimensional settings. Given n observations from a d dimensional vector, the goal is to test whether the rth component of the mean vector is the smallest among all components. We propose dimension-agnostic tests that maintain validity regardless of how d scales with n, and regardless of arbitrary ties in the mean vector. Notably, our validity holds under mild moment conditions, requiring little more than finiteness of a second moment, and permitting possibly strong dependence between coordinates. In addition, we establish the local minimax separation rate for this problem, which adapts to the cardinality of a confusion set, and show that the proposed tests attain this rate. Our method uses the sample splitting and self-normalization approach of Kim and Ramdas (2024). Our tests can be easily inverted to yield confidence sets for the argmin index. Empirical results illustrate the strong performance of our approach in terms of type I error control and power compared to existing methods.
Identifying low-dimensional latent structures within high-dimensional data has long been a central topic in the machine learning community, driven by the need for data compression, storage, transmission, and deeper data understanding. Traditional methods, such as principal component analysis (PCA) and autoencoders (AE), operate in an unsupervised manner, ignoring label information even when it is available. In this work, we introduce a unified method capable of learning latent spaces in both unsupervised and supervised settings. We formulate the problem as a nonlinear multiple-response regression within an index model context. By applying the generalized Stein's lemma, the latent space can be estimated without knowing the nonlinear link functions. Our method can be viewed as a nonlinear generalization of PCA. Moreover, unlike AE and other neural network methods that operate as "black boxes", our approach not only offers better interpretability but also reduces computational complexity while providing strong theoretical guarantees. Comprehensive numerical experiments and real data analyses demonstrate the superior performance of our method.
High-dimensional functional time series (HDFTS) are often characterized by nonlinear trends and high spatial dimensions. Such data poses unique challenges for modeling and forecasting due to the nonlinearity, nonstationarity, and high dimensionality. We propose a novel probabilistic functional neural network (ProFnet) to address these challenges. ProFnet integrates the strengths of feedforward and deep neural networks with probabilistic modeling. The model generates probabilistic forecasts using Monte Carlo sampling and also enables the quantification of uncertainty in predictions. While capturing both temporal and spatial dependencies across multiple regions, ProFnet offers a scalable and unified solution for large datasets. Applications to Japan's mortality rates demonstrate superior performance. This approach enhances predictive accuracy and provides interpretable uncertainty estimates, making it a valuable tool for forecasting complex high-dimensional functional data and HDFTS.
We introduce a novel approach to compositional data analysis based on L∞-normalization, addressing challenges posed by zero-rich high-throughput data. Traditional methods like Aitchison's transformations require excluding zeros, conflicting with the reality that omics datasets contain structural zeros that cannot be removed without violating inherent biological structures. Such datasets exist exclusively on the boundary of compositional space, making interior-focused approaches fundamentally misaligned. We present a family of Lp-normalizations, focusing on L∞-normalization due to its advantageous properties. This approach identifies compositional space with the L∞-simplex, represented as a union of top-dimensional faces called L∞-cells. Each cell consists of samples where one component's absolute abundance equals or exceeds all others, with a coordinate system identifying it with a d-dimensional unit cube. When applied to vaginal microbiome data, L∞-decomposition aligns with established Community State Types while offering advantages: each L∞-CST is named after its dominating component, has clear biological meaning, remains stable under sample changes, resolves cluster-based issues, and provides a coordinate system for exploring internal structure. We extend homogeneous coordinates through cube embedding, mapping data into a d-dimensional unit cube. These embeddings can be integrated via Cartesian product, providing unified representations from multiple perspectives. While demonstrated through microbiome studies, these methods apply to any compositional data.
Analysis of panel count data has garnered a considerable amount of attention in the literature, leading to the development of multiple statistical techniques. In inferential analysis, most of the works focus on leveraging estimating equations-based techniques or conventional maximum likelihood estimation. However, the robustness of these methods is largely questionable. In this paper, we present the robust density power divergence estimation for panel count data arising from nonhomogeneous Poisson processes, correlated through a latent frailty variable. In order to cope with real-world incidents, it is often desired to impose certain inequality constraints on the parameter space, giving rise to the restricted minimum density power divergence estimator. The significant contribution of this study lies in deriving its asymptotic properties. The proposed method ensures high efficiency in the model estimation while providing reliable inference despite data contamination. Moreover, the density power divergence measure is governed by a tuning parameter γ, which controls the trade-off between robustness and efficiency. To effectively determine the optimal value of γ, this study employs a generalized score-matching technique, marking considerable progress in the data analysis. Simulation studies and real data examples are provided to illustrate the performance of the estimator and to substantiate the theory developed.
Differential privacy (DP) is becoming increasingly important for deployed machine learning applications because it provides strong guarantees for protecting the privacy of individuals whose data is used to train models. However, DP mechanisms commonly used in machine learning tend to struggle on many real world distributions, including highly imbalanced or small labeled training sets. In this work, we propose a new scalable DP mechanism for deep learning models, SWAG-PPM, by using a pseudo posterior distribution that downweights by-record likelihood contributions proportionally to their disclosure risks as the randomized mechanism. As a motivating example from official statistics, we demonstrate SWAG-PPM on a workplace injury text classification task using a highly imbalanced public dataset published by the U.S. Occupational Safety and Health Administration (OSHA). We find that SWAG-PPM exhibits only modest utility degradation against a non-private comparator while greatly outperforming the industry standard DP-SGD for a similar privacy budget.
In this paper we consider the use of tiered background knowledge within constraint based causal discovery. Our focus is on settings relaxing causal sufficiency, i.e. allowing for latent variables which may arise because relevant information could not be measured at all, or not jointly, as in the case of multiple overlapping datasets. We first present novel insights into the properties of the 'tiered FCI' (tFCI) algorithm. Building on this, we introduce a new extension of the IOD (integrating overlapping datasets) algorithm incorporating tiered background knowledge, the 'tiered IOD' (tIOD) algorithm. We show that under full usage of the tiered background knowledge tFCI and tIOD are sound, while simple versions of the tIOD and tFCI are sound and complete. We further show that the tIOD algorithm can often be expected to be considerably more efficient and informative than the IOD algorithm even beyond the obvious restriction of the Markov equivalence classes. We provide a formal result on the conditions for this gain in efficiency and informativeness. Our results are accompanied by a series of examples illustrating the exact role and usefulness of tiered background knowledge.
We assess the value of calibrating forecast models for significant wave height Hs, wind speed W and mean spectral wave period Tm for forecast horizons between zero and 168 hours from a commercial forecast provider, to improve forecast performance for a location in the central North Sea. We consider two straightforward calibration models, linear regression (LR) and non-homogeneous Gaussian regression (NHGR), incorporating deterministic, control and ensemble mean forecast covariates. We show that relatively simple calibration models (with at most three covariates) provide good calibration and that addition of further covariates cannot be justified. Optimal calibration models (for the forecast mean of a physical quantity) always make use of the deterministic forecast and ensemble mean forecast for the same quantity, together with a covariate associated with a different physical quantity. The selection of optimal covariates is performed independently per forecast horizon, and the set of optimal covariates shows a large degree of consistency across forecast horizons. As a result, it is possible to specify a consistent model to calibrate a given physical quantity, incorporating a common set of three covariates for all horizons. For NHGR models of a given physical quantity, the ensemble forecast standard deviation for that quantity is skilful in predicting forecast error standard deviation, strikingly so for Hs. We show that the consistent LR and NHGR calibration models facilitate reduction in forecast bias to near zero for all of Hs, W and Tm, and that there is little difference between LR and NHGR calibration for the mean. Both LR and NHGR models facilitate reduction in forecast error standard deviation relative to naive adoption of the (uncalibrated) deterministic forecast, with NHGR providing somewhat better performance.
Recently introduced prior-encoding deep generative models (e.g., PriorVAE, πVAE, and PriorCVAE) have emerged as powerful tools for scalable Bayesian inference by emulating complex stochastic processes like Gaussian processes (GPs). However, these methods remain largely a proof-of-concept and inaccessible to practitioners. We propose DeepRV, a lightweight, decoder-only approach that accelerates training, and enhances real-world applicability in comparison to current VAE-based prior encoding approaches. Leveraging probabilistic programming frameworks (e.g., NumPyro) for inference, DeepRV achieves significant speedups while also improving the quality of parameter inference, closely matching full MCMC sampling. We showcase its effectiveness in process emulation and spatial analysis of the UK using simulated data, gender-wise cancer mortality rates for individuals under 50, and HIV prevalence in Zimbabwe. To bridge the gap between theory and practice, we provide a user-friendly API, enabling scalable and efficient Bayesian inference.
Cardiac real-time magnetic resonance imaging (MRI) is an emerging technology that images the heart at up to 50 frames per second, offering insight into the respiratory effects on the heartbeat. However, this method significantly increases the number of images that must be segmented to derive critical health indicators. Although neural networks perform well on inner slices, predictions on outer slices are often unreliable. This work proposes sparse Bayesian learning (SBL) to predict the ventricular volume on outer slices with minimal manual labeling to address this challenge. The ventricular volume over time is assumed to be dominated by sparse frequencies corresponding to the heart and respiratory rates. Moreover, SBL identifies these sparse frequencies on well-segmented inner slices by optimizing hyperparameters via type -II likelihood, automatically pruning irrelevant components. The identified sparse frequencies guide the selection of outer slice images for labeling, minimizing posterior variance. This work provides performance guarantees for the greedy algorithm. Testing on patient data demonstrates that only a few labeled images are necessary for accurate volume prediction. The labeling procedure effectively avoids selecting inefficient images. Furthermore, the Bayesian approach provides uncertainty estimates, highlighting unreliable predictions (e.g., when choosing suboptimal labels).
Environmental mixture approaches do not accommodate compositional outcomes, consisting of vectors constrained onto the unit simplex. This limitation poses challenges in effectively evaluating the associations between multiple concurrent environmental exposures and their respective impacts on this type of outcomes. As a result, there is a pressing need for the development of analytical methods that can more accurately assess the complexity of these relationships. Here, we extend the Bayesian weighted quantile sum regression (BWQS) framework for jointly modeling compositional outcomes and environmental mixtures using a Dirichlet distribution with a multinomial logit link function. The proposed approach, named Dirichlet-BWQS (DBWQS), allows for the simultaneous estimation of mixture weights associated with each exposure mixture component as well as the association between the overall exposure mixture index and each outcome proportion. We assess the performance of DBWQS regression on extensive simulated data and a real scenario where we investigate the associations between environmental chemical mixtures and DNA methylation-derived placental cell composition, using publicly available data (GSE75248). We also compare our findings with results considering environmental mixtures and each outcome component. Finally, we developed an R package xbwqs where we made our proposed method publicly available (https://github.com/hasdk/xbwqs).
There is increasing interest in flexible parametric models for the analysis of time-to-event data, yet Bayesian approaches that offer incorporation of prior knowledge remain underused. A flexible Bayesian parametric model has recently been proposed that uses M-splines to model the hazard function. We conducted a simulation study to assess the statistical performance of this model, which is implemented in the survextrap R package. Our simulation uses data generating mechanisms of realistic survival data based on two oncology clinical trials. Statistical performance is compared across a range of flexible models, varying the M-spline specification, smoothing procedure, priors, and other computational settings. We demonstrate good performance across realistic scenarios, including good fit of complex baseline hazard functions and time-dependent covariate effects. This work helps inform key considerations to guide model selection, as well as identifying appropriate default model settings in the software that should perform well in a broad range of applications.
We consider the problem of estimating states and parameters in a model based on a system of coupled stochastic differential equations, based on noisy discrete-time data. Special attention is given to nonlinear dynamics and state-dependent diffusivity, where transition densities are not available in closed form. Our technique adds states between times of observations, approximates transition densities using, e.g., the Euler-Maruyama method and eliminates unobserved states using the Laplace approximation. Using case studies, we demonstrate that transition probabilities are well approximated, and that inference is computationally feasible. We discuss limitations and potential extensions of the method.
Particle filters (PFs) is a class of Monte Carlo algorithms that propagate over time a set of N∈N particles which can be used to estimate, in an online fashion, the sequence of filtering distributions (ˆηt)t≥1 defined by a state-space model. Despite the popularity of PFs, the time evolution of their estimates does not appear to have been previously studied in the literature. Denoting by (ˆηNt)t≥1 the PF estimate of (ˆηt)t≥1 and letting κ∈(0,1), we first show that for any number of particles N it holds that, with probability one, we have ‖ for infinitely many t\geq 1, with \|\cdot\| a measure of distance between probability distributions. Considering a simple filtering problem we then provide reassuring results concerning the ability of PFs to estimate jointly a finite set \{\hat{\eta}_t\}_{t=1}^T of filtering distributions by studying \P(\sup_{t\in\{1,\dots,T\}}\|\hat{\eta}_t^{N}-\hat{\eta}_t\|\geq \kappa). Finally, on the same toy filtering problem, we prove that sequential quasi-Monte Carlo, a randomized quasi-Monte Carlo version of PF algorithms, offers greater safety guarantees than PFs in the sense that, for this algorithm, it holds that \lim_{N\rightarrow\infty}\sup_{t\geq 1}\|\hat{\eta}_t^N-\hat{\eta}_t\|=0 with probability one.
In a context of constant increase in competition and heightened regulatory pressure, accuracy, actuarial precision, as well as transparency and understanding of the tariff, are key issues in non-life insurance. Traditionally used generalized linear models (GLM) result in a multiplicative tariff that favors interpretability. With the rapid development of machine learning and deep learning techniques, actuaries and the rest of the insurance industry have adopted these techniques widely. However, there is a need to associate them with interpretability techniques. In this paper, our study focuses on introducing an Explainable Boosting Machine (EBM) model that combines intrinsically interpretable characteristics and high prediction performance. This approach is described as a glass-box model and relies on the use of a Generalized Additive Model (GAM) and a cyclic gradient boosting algorithm. It accounts for univariate and pairwise interaction effects between features and provides naturally explanations on them. We implement this approach on car insurance frequency and severity data and extensively compare the performance of this approach with classical competitors: a GLM, a GAM, a CART model and an Extreme Gradient Boosting (XGB) algorithm. Finally, we examine the interpretability of these models to capture the main determinants of claim costs.
In this paper, we analyze the relative errors in various reliability measures due to the tacit assumption that the components associated with a n-component series system or a parallel system are independently working where the components are dependent. We use Copula functions in said error analysis. This technique generalizes the existing work on error assessment for many wide class of distributions.
In this paper, we analyze the relative errors that crop up in the various reliability measures due to the tacit assumption that the components are independently working associated with a n-component series system or a parallel system where the components are dependent and follow a well-defined multivariate Weibull or exponential distribution. We also list some important observations which the previous authors have not noted in their earlier works. In this paper, we focus on the incurred error in multi-component series and parallel systems having multivariate Weibull distributions. In the upcoming sections, we establish that the present study has relevance with stochastic orders and statistical dependence which were not previously pointed out by previous authors.
Variable selection comprises an important step in many modern statistical inference procedures. In the regression setting, when estimators cannot shrink irrelevant signals to zero, covariates without relationships to the response often manifest small but non-zero regression coefficients. The ad hoc procedure of discarding variables whose coefficients are smaller than some threshold is often employed in practice. We formally analyze a version of such thresholding procedures and develop a simple thresholding method that consistently estimates the set of relevant variables under mild regularity assumptions. Using this thresholding procedure, we propose a sparse, \sqrt{n}-consistent and asymptotically normal estimator whose non-zero elements do not exhibit shrinkage. The performance and applicability of our approach are examined via numerical studies of simulated and real data.
We introduce squared families, which are families of probability densities obtained by squaring a linear transformation of a statistic. Squared families are singular, however their singularity can easily be handled so that they form regular models. After handling the singularity, squared families possess many convenient properties. Their Fisher information is a conformal transformation of the Hessian metric induced from a Bregman generator. The Bregman generator is the normalising constant, and yields a statistical divergence on the family. The normalising constant admits a helpful parameter-integral factorisation, meaning that only one parameter-independent integral needs to be computed for all normalising constants in the family, unlike in exponential families. Finally, the squared family kernel is the only integral that needs to be computed for the Fisher information, statistical divergence and normalising constant. We then describe how squared families are special in the broader class of g-families, which are obtained by applying a sufficiently regular function g to a linear transformation of a statistic. After removing special singularities, positively homogeneous families and exponential families are the only g-families for which the Fisher information is a conformal transformation of the Hessian metric, where the generator depends on the parameter only through the normalising constant. Even-order monomial families also admit parameter-integral factorisations, unlike exponential families. We study parameter estimation and density estimation in squared families, in the well-specified and misspecified settings. We use a universal approximation property to show that squared families can learn sufficiently well-behaved target densities at a rate of \mathcal{O}(N^{-1/2})+C n^{-1/4}, where N is the number of datapoints, n is the number of parameters, and C is some constant.
Proteins are essential components of all living organisms and play a critical role in cellular survival. They have a broad range of applications, from clinical treatments to material engineering. This versatility has spurred the development of protein design, with amino acid sequence design being a crucial step in the process. Recent advancements in deep generative models have shown promise for protein sequence design. However, the scarcity of functional protein sequence data for certain types can hinder the training of these models, which often require large datasets. To address this challenge, we propose a hierarchical model named ProteinRG that can generate functional protein sequences using relatively small datasets. ProteinRG begins by generating a representation of a protein sequence, leveraging existing large protein sequence models, before producing a functional protein sequence. We have tested our model on various functional protein sequences and evaluated the results from three perspectives: multiple sequence alignment, t-SNE distribution analysis, and 3D structure prediction. The findings indicate that our generated protein sequences maintain both similarity to the original sequences and consistency with the desired functions. Moreover, our model demonstrates superior performance compared to other generative models for protein sequence generation.