2024-08-15 | | Total: 31

By representing any constraint-based causal learning algorithm via a placeholder property, we decompose the correctness condition into a part relating the distribution and the true causal graph, and a part that depends solely on the distribution. This provides a general framework to obtain correctness conditions for causal learning, and has the following implications. We provide exact correctness conditions for the PC algorithm, which are then related to correctness conditions of some other existing causal discovery algorithms. We show that the sparsest Markov representation condition is the weakest correctness condition resulting from existing notions of minimality for maximal ancestral graphs and directed acyclic graphs. We also reason that additional knowledge than just Pearl-minimality is necessary for causal learning beyond faithfulness.

Conventional off-policy reinforcement learning (RL) focuses on maximizing the expected return of scalar rewards. Distributional RL (DRL), in contrast, studies the distribution of returns with the distributional Bellman operator in a Euclidean space, leading to highly flexible choices for utility. This paper establishes robust theoretical foundations for DRL. We prove the contraction property of the Bellman operator even when the reward space is an infinite-dimensional separable Banach space. Furthermore, we demonstrate that the behavior of high- or infinite-dimensional returns can be effectively approximated using a lower-dimensional Euclidean space. Leveraging these theoretical insights, we propose a novel DRL algorithm that tackles problems which have been previously intractable using conventional reinforcement learning approaches.

We revisit the fundamental issue of tail behavior of accumulated random realizations when individual realizations are independent, and we develop new sharper bounds on the tail probability and expected linear loss. The underlying distribution is semi-parametric in the sense that it remains unrestricted other than the assumed mean and variance. Our sharp bounds complement well-established results in the literature, including those based on aggregation, which often fail to take full account of independence and use less elegant proofs. New insights include a proof that in the non-identical case, the distributions attaining the bounds have the equal range property, and that the impact of each random variable on the expected value of the sum can be isolated using an extension of the Korkine identity. We show that the new bounds not only complement the extant results but also open up abundant practical applications, including improved pricing of product bundles, more precise option pricing, more efficient insurance design, and better inventory management.

Understanding how the collective activity of neural populations relates to computation and ultimately behavior is a key goal in neuroscience. To this end, statistical methods which describe high-dimensional neural time series in terms of low-dimensional latent dynamics have played a fundamental role in characterizing neural systems. Yet, what constitutes a successful method involves two opposing criteria: (1) methods should be expressive enough to capture complex nonlinear dynamics, and (2) they should maintain a notion of interpretability often only warranted by simpler linear models. In this paper, we develop an approach that balances these two objectives: the Gaussian Process Switching Linear Dynamical System (gpSLDS). Our method builds on previous work modeling the latent state evolution via a stochastic differential equation whose nonlinear dynamics are described by a Gaussian process (GP-SDEs). We propose a novel kernel function which enforces smoothly interpolated locally linear dynamics, and therefore expresses flexible -- yet interpretable -- dynamics akin to those of recurrent switching linear dynamical systems (rSLDS). Our approach resolves key limitations of the rSLDS such as artifactual oscillations in dynamics near discrete state boundaries, while also providing posterior uncertainty estimates of the dynamics. To fit our models, we leverage a modified learning objective which improves the estimation accuracy of kernel hyperparameters compared to previous GP-SDE fitting approaches. We apply our method to synthetic data and data recorded in two neuroscience experiments and demonstrate favorable performance in comparison to the rSLDS.

There is growing interest in using safety analytics and machine learning to support the prevention of workplace incidents, especially in high-risk industries like construction and trucking. Although existing safety analytics studies have made remarkable progress, they suffer from imbalanced datasets, a common problem in safety analytics, resulting in prediction inaccuracies. This can lead to management problems, e.g., incorrect resource allocation and improper interventions. To overcome the imbalanced data problem, we extend the theory of accident triangle to claim that the importance of data samples should be based on characteristics such as injury severity, accident frequency, and accident type. Thus, three oversampling methods are proposed based on assigning different weights to samples in the minority class. We find robust improvements among different machine learning algorithms. For the lack of open-source safety datasets, we are sharing three imbalanced datasets, e.g., a 9-year nationwide construction accident record dataset, and their corresponding codes.

Graph neural networks (GNNs) take as input the graph structure and the feature vectors associated with the nodes. Both contain noisy information about the labels. Here we propose joint denoising and rewiring (JDR)--an algorithm to jointly denoise the graph structure and features, which can improve the performance of any downstream algorithm. We do this by defining and maximizing the alignment between the leading eigenspaces of graph and feature matrices. To approximately solve this computationally hard problem, we propose a heuristic that efficiently handles real-world graph datasets with many classes and different levels of homophily or heterophily. We experimentally verify the effectiveness of our approach on synthetic data and real-world graph datasets. The results show that JDR consistently outperforms existing rewiring methods on node classification tasks using GNNs as downstream models.

Estimating causal effects from observational data is challenging, especially in the presence of latent confounders. Much work has been done on addressing this challenge, but most of the existing research ignores the bias introduced by the post-treatment variables. In this paper, we propose a novel method of joint Variational AutoEncoder (VAE) and identifiable Variational AutoEncoder (iVAE) for learning the representations of latent confounders and latent post-treatment variables from their proxy variables, termed CPTiVAE, to achieve unbiased causal effect estimation from observational data. We further prove the identifiability in terms of the representation of latent post-treatment variables. Extensive experiments on synthetic and semi-synthetic datasets demonstrate that the CPTiVAE outperforms the state-of-the-art methods in the presence of latent confounders and post-treatment variables. We further apply CPTiVAE to a real-world dataset to show its potential application.

We consider minimizing finite-sum and expectation objective functions via Hessian-averaging based subsampled Newton methods. These methods allow for gradient inexactness and have fixed per-iteration Hessian approximation costs. The recent work (Na et al. 2023) demonstrated that Hessian averaging can be utilized to achieve fast $\mathcal{O}\left(\sqrt{\tfrac{\log k}{k}}\right)$ local superlinear convergence for strongly convex functions in high probability, while maintaining fixed per-iteration Hessian costs. These methods, however, require gradient exactness and strong convexity, which poses challenges for their practical implementation. To address this concern we consider Hessian-averaged methods that allow gradient inexactness via norm condition based adaptive-sampling strategies. For the finite-sum problem we utilize deterministic sampling techniques which lead to global linear and sublinear convergence rates for strongly convex and nonconvex functions respectively. In this setting we are able to derive an improved deterministic local superlinear convergence rate of $\mathcal{O}\left(\tfrac{1}{k}\right)$. For the %expected risk expectation problem we utilize stochastic sampling techniques, and derive global linear and sublinear rates for strongly convex and nonconvex functions, as well as a $\mathcal{O}\left(\tfrac{1}{\sqrt{k}}\right)$ local superlinear convergence rate, all in expectation. We present novel analysis techniques that differ from the previous probabilistic results. Additionally, we propose scalable and efficient variations of these methods via diagonal approximations and derive the novel diagonally-averaged Newton (Dan) method for large-scale problems. Our numerical results demonstrate that the Hessian averaging not only helps with convergence, but can lead to state-of-the-art performance on difficult problems such as CIFAR100 classification with ResNets.

Understanding distinct neurological aging patterns across various populations is vital in the context of a globally aging populace. This study seeks to unravel the structural variations in the aging brain, taking into consideration different ethnic backgrounds. MRI data from Indian, Chinese, Japanese, and Caucasian populations were analyzed using a two-pronged approach. Initially, a group analysis was performed involving tissue segmentation through FSL-FAST, examining gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF). Subsequently, a continuous model-based analysis was employed, defining aging as a diffeomorphic transformation, which facilitated a detailed intra- and inter-population analysis, and examined both global anatomy and age-dependent distances of each population in comparison to the Indian population. Detailed insights into the spatial distribution of brain deformations over time were obtained, with particular focus on anatomical changes relative to a reference time point. The proposed comprehensive structural comparison framework, applied to a sample dataset encompassing four distinct populations, represents a pioneering effort to compare structural differences related to brain aging on a global scale. Subsequent studies utilizing larger datasets and applying the proposed analysis across groups based on various criteria will further advance our understanding of aging-related changes.

Social contact studies, investigating social contact patterns in a population sample, have been an important contribution for epidemic models to better fit real life epidemics. A contact matrix $M$, having the \emph{mean} number of contacts between individuals of different age groups as its elements, is estimated and used in combination with a multitype epidemic model to produce better data fitting and also giving more appropriate expressions for $R_0$ and other model outcomes. However, $M$ does not capture \emph{variation} in contacts \emph{within} each age group, which is often large in empirical settings. Here such variation within age groups is included in a simple way by dividing each age group into two halves: the socially active and the socially less active. The extended contact matrix, and its associated epidemic model, empirically show that acknowledging variation in social activity within age groups has a substantial impact on modelling outcomes such as $R_0$ and the final fraction $\tau$ getting infected. In fact, the variation in social activity within age groups is often more important for data fitting than the division into different age groups itself. However, a difficulty with heterogeneity in social activity is that social contact studies typically lack information on if mixing with respect to social activity is assortative or not, i.e.\ do socially active tend to mix more with other socially active or more with socially less active? The analyses show that accounting for heterogeneity in social activity improves the analyses irrespective of if such mixing is assortative or not, but the different assumptions gives rather different output. Future social contact studies should hence also try to infer the degree of assortativity of contacts with respect to social activity.

We give a comprehensive description of Wasserstein gradient flows of maximum mean discrepancy (MMD) functionals $\mathcal F_\nu := \text{MMD}_K^2(\cdot, \nu)$ towards given target measures $\nu$ on the real line, where we focus on the negative distance kernel $K(x,y) := -|x-y|$. In one dimension, the Wasserstein-2 space can be isometrically embedded into the cone $\mathcal C(0,1) \subset L_2(0,1)$ of quantile functions leading to a characterization of Wasserstein gradient flows via the solution of an associated Cauchy problem on $L_2(0,1)$. Based on the construction of an appropriate counterpart of $\mathcal F_\nu$ on $L_2(0,1)$ and its subdifferential, we provide a solution of the Cauchy problem. For discrete target measures $\nu$, this results in a piecewise linear solution formula. We prove invariance and smoothing properties of the flow on subsets of $\mathcal C(0,1)$. For certain $\mathcal F_\nu$-flows this implies that initial point measures instantly become absolutely continuous, and stay so over time. Finally, we illustrate the behavior of the flow by various numerical examples using an implicit Euler scheme and demonstrate differences to the explicit Euler scheme, which is easier to compute, but comes with limited convergence guarantees.

We present a novel approach for test-time adaptation via online self-training, consisting of two components. First, we introduce a statistical framework that detects distribution shifts in the classifier's entropy values obtained on a stream of unlabeled samples. Second, we devise an online adaptation mechanism that utilizes the evidence of distribution shifts captured by the detection tool to dynamically update the classifier's parameters. The resulting adaptation process drives the distribution of test entropy values obtained from the self-trained classifier to match those of the source domain, building invariance to distribution shifts. This approach departs from the conventional self-training method, which focuses on minimizing the classifier's entropy. Our approach combines concepts in betting martingales and online learning to form a detection tool capable of quickly reacting to distribution shifts. We then reveal a tight relation between our adaptation scheme and optimal transport, which forms the basis of our novel self-supervised loss. Experimental results demonstrate that our approach improves test-time accuracy under distribution shifts while maintaining accuracy and calibration in their absence, outperforming leading entropy minimization methods across various scenarios.

Traditional models reliant solely on pairwise associations often prove insufficient in capturing the complex statistical structure inherent in multivariate data. Yet existing methods for identifying information shared among groups of $d>3$ variables are often intractable; asymmetric around a target variable; or unable to consider all factorisations of the joint probability distribution. Here, we present a framework that systematically derives high-order measures using lattice and operator function pairs, whereby the lattice captures the algebraic relational structure of the variables and the operator function computes measures over the lattice. We show that many existing information-theoretic high-order measures can be derived by using divergences as operator functions on sublattices of the partition lattice, thus preventing the accurate quantification of all interactions for $d>3$. Similarly, we show that using the KL divergence as the operator function also leads to unwanted cancellation of interactions for $d>3$. To characterise all interactions among $d$ variables, we introduce the Streitberg information defined on the full partition lattice using generalisations of the KL divergence as operator functions. We validate our results numerically on synthetic data, and illustrate the use of the Streitberg information through applications to stock market returns and neural electrophysiology data.

This paper introduces a novel anomaly detection framework that combines the robust statistical principles of density-estimation-based anomaly detection methods with the representation-learning capabilities of deep learning models. The method originated from this framework is presented in two different versions: a shallow approach employing a density-estimation model based on adaptive Fourier features and density matrices, and a deep approach that integrates an autoencoder to learn a low-dimensional representation of the data. By estimating the density of new samples, both methods are able to find normality scores. The methods can be seamlessly integrated into an end-to-end architecture and optimized using gradient-based optimization techniques. To evaluate their performance, extensive experiments were conducted on various benchmark datasets. The results demonstrate that both versions of the method can achieve comparable or superior performance when compared to other state-of-the-art methods. Notably, the shallow approach performs better on datasets with fewer dimensions, while the autoencoder-based approach shows improved performance on datasets with higher dimensions.

Recent years have seen a resurgence in interest in marketing mix models (MMMs), which are aggregate-level models of marketing effectiveness. Often these models incorporate nonlinear effects, and either implicitly or explicitly assume that marketing effectiveness varies over time. In this paper, we show that nonlinear and time-varying effects are often not identifiable from standard marketing mix data: while certain data patterns may be suggestive of nonlinear effects, such patterns may also emerge under simpler models that incorporate dynamics in marketing effectiveness. This lack of identification is problematic because nonlinearities and dynamics suggest fundamentally different optimal marketing allocations. We examine this identification issue through theory and simulations, wherein we explore the exact conditions under which conflation between the two types of models is likely to occur. In doing so, we introduce a flexible Bayesian nonparametric model that allows us to both flexibly simulate and estimate different data-generating processes. We show that conflating the two types of effects is especially likely in the presence of autocorrelated marketing variables, which are common in practice, especially given the widespread use of stock variables to capture long-run effects of advertising. We illustrate these ideas through numerous empirical applications to real-world marketing mix data, showing the prevalence of the conflation issue in practice. Finally, we show how marketers can avoid this conflation, by designing experiments that strategically manipulate spending in ways that pin down model form.

We demonstrate that adaptively controlling the size of individual regression trees in a random forest can improve predictive performance, contrary to the conventional wisdom that trees should be fully grown. A fast pruning algorithm, alpha-trimming, is proposed as an effective approach to pruning trees within a random forest, where more aggressive pruning is performed in regions with a low signal-to-noise ratio. The amount of overall pruning is controlled by adjusting the weight on an information criterion penalty as a tuning parameter, with the standard random forest being a special case of our alpha-trimmed random forest. A remarkable feature of alpha-trimming is that its tuning parameter can be adjusted without refitting the trees in the random forest once the trees have been fully grown once. In a benchmark suite of 46 example data sets, mean squared prediction error is often substantially lowered by using our pruning algorithm and is never substantially increased compared to a random forest with fully-grown trees at default parameter settings.

While randomized trials may be the gold standard for evaluating the effectiveness of the treatment intervention, in some special circumstances, single-arm clinical trials utilizing external control may be considered. The causal treatment effect of interest for single-arm studies is usually the average treatment effect on the treated (ATT) rather than the average treatment effect (ATE). Although methods have been developed to estimate the ATT, the selection and use of these methods require a thorough comparison and in-depth understanding of the advantages and disadvantages of these methods. In this study, we conducted simulations under different identifiability assumptions to compare the performance metrics (e.g., bias, standard deviation (SD), mean squared error (MSE), type I error rate) for a variety of methods, including the regression model, propensity score matching, Mahalanobis distance matching, coarsened exact matching, inverse probability weighting, augmented inverse probability weighting (AIPW), AIPW with SuperLearner, and targeted maximum likelihood estimator (TMLE) with SuperLearner. Our simulation results demonstrate that the doubly robust methods in general have smaller biases than other methods. In terms of SD, nonmatching methods in general have smaller SDs than matching-based methods. The performance of MSE is a trade-off between the bias and SD, and no method consistently performs better in term of MSE. The identifiability assumptions are critical to the models' performance: violation of the positivity assumption can lead to a significant inflation of type I errors in some methods; violation of the unconfoundedness assumption can lead to a large bias for all methods... (Further details are available in the main body of the paper).

This paper introduces a local linear smoother for regression surfaces on the simplex. The estimator solves a least-squares regression problem weighted by a locally adaptive Dirichlet kernel, ensuring excellent boundary properties. Asymptotic results for the bias, variance, mean squared error, and mean integrated squared error are derived, generalizing the univariate results of Chen (2002). A simulation study shows that the proposed local linear estimator with Dirichlet kernel outperforms its only direct competitor in the literature, the Nadaraya-Watson estimator with Dirichlet kernel due to Bouzebda, Nezzal, and Elhattab (2024).

We introduce a generic estimator for the false discovery rate of any model selection procedure, in common statistical modeling settings including the Gaussian linear model, Gaussian graphical model, and model-X setting. We prove that our method has a conservative (non-negative) bias in finite samples under standard statistical assumptions, and provide a bootstrap method for assessing its standard error. For methods like the Lasso, forward-stepwise regression, and the graphical Lasso, our estimator serves as a valuable companion to cross-validation, illuminating the tradeoff between prediction error and variable selection accuracy as a function of the model complexity parameter.

If the conclusion of a data analysis is sensitive to dropping very few data points, that conclusion might hinge on the particular data at hand rather than representing a more broadly applicable truth. How could we check whether this sensitivity holds? One idea is to consider every small subset of data, drop it from the dataset, and re-run our analysis. But running MCMC to approximate a Bayesian posterior is already very expensive; running multiple times is prohibitive, and the number of re-runs needed here is combinatorially large. Recent work proposes a fast and accurate approximation to find the worst-case dropped data subset, but that work was developed for problems based on estimating equations -- and does not directly handle Bayesian posterior approximations using MCMC. We make two principal contributions in the present work. We adapt the existing data-dropping approximation to estimators computed via MCMC. Observing that Monte Carlo errors induce variability in the approximation, we use a variant of the bootstrap to quantify this uncertainty. We demonstrate how to use our approximation in practice to determine whether there is non-robustness in a problem. Empirically, our method is accurate in simple models, such as linear regression. In models with complicated structure, such as hierarchical models, the performance of our method is mixed.

We study the problem of learning multi-index models in high-dimensions using a two-layer neural network trained with the mean-field Langevin algorithm. Under mild distributional assumptions on the data, we characterize the effective dimension $d_{\mathrm{eff}}$ that controls both sample and computational complexity by utilizing the adaptivity of neural networks to latent low-dimensional structures. When the data exhibit such a structure, $d_{\mathrm{eff}}$ can be significantly smaller than the ambient dimension. We prove that the sample complexity grows almost linearly with $d_{\mathrm{eff}}$, bypassing the limitations of the information and generative exponents that appeared in recent analyses of gradient-based feature learning. On the other hand, the computational complexity may inevitably grow exponentially with $d_{\mathrm{eff}}$ in the worst-case scenario. Motivated by improving computational complexity, we take the first steps towards polynomial time convergence of the mean-field Langevin algorithm by investigating a setting where the weights are constrained to be on a compact manifold with positive Ricci curvature, such as the hypersphere. There, we study assumptions under which polynomial time convergence is achievable, whereas similar assumptions in the Euclidean setting lead to exponential time complexity.

Linear mixed effects models are widely used in statistical modelling. We consider a mixed effects model with Bayesian variable selection in the random effects using spike-and-slab priors and developed a variational Bayes inference scheme that can be applied to large data sets. An EM algorithm is proposed for the model with normal errors where the posterior distribution of the variable inclusion parameters is approximated using an Occam's window approach. Placing this approach within a variational Bayes scheme also the algorithm to be extended to the model with skew-t errors. The performance of the algorithm is evaluated in a simulation study and applied to a longitudinal model for elite athlete performance in the 100 metre sprint and weightlifting.

The problem of finding the expected value of a statistic of a locally stable point process in a bounded region is addressed. We propose an adaptive importance sampling for solving the problem. In our proposal, we restrict the importance point process to the family of homogeneous Poisson point processes, which enables us to generate quickly independent samples of the importance point process. The optimal intensity of the importance point process is found by applying the cross-entropy minimization method. In the proposed scheme, the expected value of the function and the optimal intensity are iteratively estimated in an adaptive manner. We show that the proposed estimator converges to the target value almost surely, and prove the asymptotic normality of it. We explain how to apply the proposed scheme to the estimation of the intensity of a stationary pairwise interaction point process. The performance of the proposed scheme is compared numerically with the Markov chain Monte Carlo simulation and the perfect sampling.

In this paper, we present a comprehensive analysis of the posterior covariance field in Gaussian processes, with applications to the posterior covariance matrix. The analysis is based on the Gaussian prior covariance but the approach also applies to other covariance kernels. Our geometric analysis reveals how the Gaussian kernel's bandwidth parameter and the spatial distribution of the observations influence the posterior covariance as well as the corresponding covariance matrix, enabling straightforward identification of areas with high or low covariance in magnitude. Drawing inspiration from the a posteriori error estimation techniques in adaptive finite element methods, we also propose several estimators to efficiently measure the absolute posterior covariance field, which can be used for efficient covariance matrix approximation and preconditioning. We conduct a wide range of experiments to illustrate our theoretical findings and their practical applications.

In this paper we consider the modeling of measurement error for fund returns data. In particular, given access to a time-series of discretely observed log-returns and the associated maximum over the observation period, we develop a stochastic model which models the true log-returns and maximum via a L\'evy process and the data as a measurement error there-of. The main technical difficulty of trying to infer this model, for instance Bayesian parameter estimation, is that the joint transition density of the return and maximum is seldom known, nor can it be simulated exactly. Based upon the novel stick breaking representation of [12] we provide an approximation of the model. We develop a Markov chain Monte Carlo (MCMC) algorithm to sample from the Bayesian posterior of the approximated posterior and then extend this to a multilevel MCMC method which can reduce the computational cost to approximate posterior expectations, relative to ordinary MCMC. We implement our methodology on several applications including for real data.