Date: Fri, 9 Aug 2024 | Total: 23

Difference-in-differences (DiD) is the most popular observational causal inference method in health policy, employed to evaluate the real-world impact of policies and programs. To estimate treatment effects, DiD relies on the "parallel trends assumption", that on average treatment and comparison groups would have had parallel trajectories in the absence of an intervention. Historically, DiD has been considered broadly applicable and straightforward to implement, but recent years have seen rapid advancements in DiD methods. This paper reviews and synthesizes these innovations for medical and health policy researchers. We focus on four topics: (1) assessing the parallel trends assumption in health policy contexts; (2) relaxing the parallel trends assumption when appropriate; (3) employing estimators to account for staggered treatment timing; and (4) conducting robust inference for analyses in which normal-based clustered standard errors are inappropriate. For each, we explain challenges and common pitfalls in traditional DiD and modern methods available to address these issues.

Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging recent techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.

In this paper, we discuss the asymptotic behavior of the Upper Confidence Bound (UCB) algorithm in the context of multiarmed bandit problems and discuss its implication in downstream inferential tasks. While inferential tasks become challenging when data is collected in a sequential manner, we argue that this problem can be alleviated when the sequential algorithm at hand satisfies certain stability property. This notion of stability is motivated from the seminal work of Lai and Wei (1982). Our first main result shows that such a stability property is always satisfied for the UCB algorithm, and as a result the sample means for each arm are asymptotically normal. Next, we examine the stability properties of the UCB algorithm when the number of arms $K$ is allowed to grow with the number of arm pulls $T$. We show that in such a case the arms are stable when $\frac{\log K}{\log T} \rightarrow 0$, and the number of near-optimal arms are large.

Hybrid Reinforcement Learning (RL), where an agent learns from both an offline dataset and online explorations in an unknown environment, has garnered significant recent interest. A crucial question posed by Xie et al. (2022) is whether hybrid RL can improve upon the existing lower bounds established in purely offline and purely online RL without relying on the single-policy concentrability assumption. While Li et al. (2023) provided an affirmative answer to this question in the tabular PAC RL case, the question remains unsettled for both the regret-minimizing RL case and the non-tabular case. In this work, building upon recent advancements in offline RL and reward-agnostic exploration, we develop computationally efficient algorithms for both PAC and regret-minimizing RL with linear function approximation, without single-policy concentrability. We demonstrate that these algorithms achieve sharper error or regret bounds that are no worse than, and can improve on, the optimal sample complexity in offline RL (the first algorithm, for PAC RL) and online RL (the second algorithm, for regret-minimizing RL) in linear Markov decision processes (MDPs), regardless of the quality of the behavior policy. To our knowledge, this work establishes the tightest theoretical guarantees currently available for hybrid RL in linear MDPs.

Estimating the total treatment effect (TTE) of a new feature in social platforms is crucial for understanding its impact on user behavior. However, the presence of network interference, which arises from user interactions, often complicates this estimation process. Experimenters typically face challenges in fully capturing the intricate structure of this interference, leading to less reliable estimates. To address this issue, we propose a novel approach that leverages surrogate networks and the pseudo inverse estimator. Our contributions can be summarized as follows: (1) We introduce the surrogate network framework, which simulates the practical situation where experimenters build an approximation of the true interference network using observable data. (2) We investigate the performance of the pseudo inverse estimator within this framework, revealing a bias-variance trade-off introduced by the surrogate network. We demonstrate a tighter asymptotic variance bound compared to previous studies and propose an enhanced variance estimator outperforming the original estimator. (3) We apply the pseudo inverse estimator to a real experiment involving over 50 million users, demonstrating its effectiveness in detecting network interference when combined with the difference-in-means estimator. Our research aims to bridge the gap between theoretical literature and practical implementation, providing a solution for estimating TTE in the presence of network interference and unknown interference structures.

We present the design and scalable implementation of an exascale climate emulator for addressing the escalating computational and storage requirements of high-resolution Earth System Model simulations. We utilize the spherical harmonic transform to stochastically model spatio-temporal variations in climate data. This provides tunable spatio-temporal resolution and significantly improves the fidelity and granularity of climate emulation, achieving an ultra-high spatial resolution of 0.034 (approximately 3.5 km) in space. Our emulator, trained on 318 billion hourly temperature data points from a 35-year and 31 billion daily data points from an 83-year global simulation ensemble, generates statistically consistent climate emulations. We extend linear solver software to mixed-precision arithmetic GPUs, applying different precisions within a single solver to adapt to different correlation strengths. The PaRSEC runtime system supports efficient parallel matrix operations by optimizing the dynamic balance between computation, communication, and memory requirements. Our BLAS3-rich code is optimized for systems equipped with four different families and generations of GPUs, scaling well to achieve 0.976 EFlop/s on 9,025 nodes (36,100 AMD MI250X multichip module (MCM) GPUs) of Frontier (nearly full system), 0.739 EFlop/s on 1,936 nodes (7,744 Grace-Hopper Superchips (GH200)) of Alps, 0.243 EFlop/s on 1,024 nodes (4,096 A100 GPUs) of Leonardo, and 0.375 EFlop/s on 3,072 nodes (18,432 V100 GPUs) of Summit.

Symbolic data analysis (SDA) aggregates large individual-level datasets into a small number of distributional summaries, such as random rectangles or random histograms. Inference is carried out using these summaries in place of the original dataset, resulting in computational gains at the loss of some information. In likelihood-based SDA, the likelihood function is characterised by an integral with a large exponent, which limits the method's utility as for typical models the integral unavailable in closed form. In addition, the likelihood function is known to produce biased parameter estimates in some circumstances. Our article develops a Bayesian framework for SDA methods in these settings that resolves the issues resulting from integral intractability and biased parameter estimation using pseudo-marginal Markov chain Monte Carlo methods. We develop an exact but computationally expensive method based on path sampling and the block-Poisson estimator, and a much faster, but approximate, method based on Taylor expansion. Through simulation and real-data examples we demonstrate the performance of the developed methods, showing large reductions in computation time compared to the full-data analysis, with only a small loss of information.

In this paper the accuracy and robustness of quality measures for the assessment of machine learning models are investigated. The prediction quality of a machine learning model is evaluated model-independent based on a cross-validation approach, where the approximation error is estimated for unknown data. The presented measures quantify the amount of explained variation in the model prediction. The reliability of these measures is assessed by means of several numerical examples, where an additional data set for the verification of the estimated prediction error is available. Furthermore, the confidence bounds of the presented quality measures are estimated and local quality measures are derived from the prediction residuals obtained by the cross-validation approach.

Uncovering genuine relationships between a response variable of interest and a large collection of covariates is a fundamental and practically important problem. In the context of Gaussian linear models, both the Bayesian and non-Bayesian literature is well-developed and there are no substantial differences in the model selection consistency results available from the two schools. For the more challenging generalized linear models (GLMs), however, Bayesian model selection consistency results are lacking in several ways. In this paper, we construct a Bayesian posterior distribution using an appropriate data-dependent prior and develop its asymptotic concentration properties using new theoretical techniques. In particular, we leverage Spokoiny's powerful non-asymptotic theory to obtain sharp quadratic approximations of the GLM's log-likelihood function, which leads to tight bounds on the errors associated with the model-specific maximum likelihood estimators and the Laplace approximation of our Bayesian marginal likelihood. In turn, these improved bounds lead to significantly stronger, near-optimal Bayesian model selection consistency results, e.g., far weaker beta-min conditions, compared to those available in the existing literature. In particular, our results are applicable to the Poisson regression model, in which the score function is not sub-Gaussian.

There is currently a focus on statistical methods which can use external trial information to help accelerate the discovery, development and delivery of medicine. Bayesian methods facilitate borrowing which is "dynamic" in the sense that the similarity of the data helps to determine how much information is used. We propose a Bayesian semiparameteric model, which allows the baseline hazard to take any form through an ensemble average. We introduce priors to smooth the posterior baseline hazard improving both model estimation and borrowing characteristics. A "lump-and-smear" borrowing prior accounts for non-exchangable historical data and helps reduce the maximum type I error in the presence of prior-data conflict. In this article, we present BayesFBHborrow, an R package, which enables the user to perform Bayesian borrowing with a historical control dataset in a semiparameteric time-to-event model. User-defined hyperparameters smooth an ensemble averaged posterior baseline hazard. The model offers the specification of lump-and-smear priors on the commensurability parameter where the associated hyperparameters can be chosen according to the users tolerance for difference between the log baseline hazards. We demonstrate the performance of our Bayesian flexible baseline hazard model on a simulated and real world dataset.

Previous studies yielded discouraging results for item-level locally differentially private linear regression with $s^*$-sparsity assumption, where the minimax rate for $nm$ samples is $\mathcal{O}(s^{*}d / nm\varepsilon^2)$. This can be challenging for high-dimensional data, where the dimension $d$ is extremely large. In this work, we investigate user-level locally differentially private sparse linear regression. We show that with $n$ users each contributing $m$ samples, the linear dependency of dimension $d$ can be eliminated, yielding an error upper bound of $\mathcal{O}(s^{*2} / nm\varepsilon^2)$. We propose a framework that first selects candidate variables and then conducts estimation in the narrowed low-dimensional space, which is extendable to general sparse estimation problems with tight error bounds. Experiments on both synthetic and real datasets demonstrate the superiority of the proposed methods. Both the theoretical and empirical results suggest that, with the same number of samples, locally private sparse estimation is better conducted when multiple samples per user are available.

The network data has attracted considerable attention in modern statistics. In research on complex network data, one key issue is finding its underlying connection structure given a network sample. The methods that have been proposed in literature usually assume that the underlying structure is a known model. In practice, however, the true model is usually unknown, and network learning procedures based on these methods may suffer from model misspecification. To handle this issue, based on the random matrix theory, we first give a spectral property of the normalized adjacency matrix under a mild condition. Further, we establish a general goodness-of-fit test procedure for the unweight and undirected network. We prove that the null distribution of the proposed statistic converges in distribution to the standard normal distribution. Theoretically, this testing procedure is suitable for nearly all popular network models, such as stochastic block models, and latent space models. Further, we apply the proposed method to the degree-corrected mixed membership model and give a sequential estimator of the number of communities. Both simulation studies and real-world data examples indicate that the proposed method works well.

Estimating the maximum mean finds a variety of applications in practice. In this paper, we study estimation of the maximum mean using an upper confidence bound (UCB) approach where the sampling budget is adaptively allocated to one of the systems. We study in depth the existing grand average (GA) estimator, and propose a new largest-size average (LSA) estimator. Specifically, we establish statistical guarantees, including strong consistency, asymptotic mean squared errors, and central limit theorems (CLTs) for both estimators, which are new to the literature. We show that LSA is preferable over GA, as the bias of the former decays at a rate much faster than that of the latter when sample size increases. By using the CLTs, we further construct asymptotically valid confidence intervals for the maximum mean, and propose a single hypothesis test for a multiple comparison problem with application to clinical trials. Statistical efficiency of the resulting point and interval estimates and the proposed single hypothesis test is demonstrated via numerical examples.

A central pillar of the UK's response to the SARS-CoV-2 pandemic was the provision of up-to-the moment nowcasts and short term projections to monitor current trends in transmission and associated healthcare burden. Here we present a detailed deconstruction of one of the 'real-time' models that was key contributor to this response, focussing on the model adaptations required over three pandemic years characterised by the imposition of lockdowns, mass vaccination campaigns and the emergence of new pandemic strains. The Bayesian model integrates an array of surveillance and other data sources including a novel approach to incorporating prevalence estimates from an unprecedented large-scale household survey. We present a full range of estimates of the epidemic history and the changing severity of the infection, quantify the impact of the vaccination programme and deconstruct contributing factors to the reproduction number. We further investigate the sensitivity of model-derived insights to the availability and timeliness of prevalence data, identifying its importance to the production of robust estimates.

In this paper we propose a procedure for robust estimation in the context of generalized linear models based on the maximum Lq-likelihood method. Alongside this, an estimation algorithm that represents a natural extension of the usual iteratively weighted least squares method in generalized linear models is presented. It is through the discussion of the asymptotic distribution of the proposed estimator and a set of statistics for testing linear hypothesis that it is possible to define standardized residuals using the mean-shift outlier model. In addition, robust versions of deviance function and the Akaike information criterion are defined with the aim of providing tools for model selection. Finally, the performance of the proposed methodology is illustrated through a simulation study and analysis of a real dataset.

The combined influence of urban heat islands, climate change, and extreme temperature events are increasingly impacting transit travelers, especially vulnerable populations such as older adults, people with disabilities, and those with chronic diseases. Previous studies have generally attempted to address this issue at either the micro- or macro-level, but each approach presents different limitations in modeling the impacts on transit trips. Other research proposes a meso-level approach to address some of these gaps, but the use of additive exposure calculation and spatial shortest path routing poses constraints meso-modeling accuracy. This study introduces HeatPath Analyzer, a framework to assess the exposure of transit riders to extreme temperatures, using TransitSim 4.0 to generate second-by-second spatio-temporal trip trajectories, the traveler activity profiles, and thermal comfort levels along the entire journey. The approach uses heat stress combines the standards proposed by the NWS and CDC to estimate cumulative exposure for transit riders, with specific parameters tailored to the elderly and people with disabilities. The framework assesses the influence of extreme heat and winter chill. A case study in Atlanta, GA, reveals that 10.2% of trips on an average summer weekday in 2019 were at risk of extreme heat. The results uncover exposure disparities across different transit trip mode segments, and across mitigation-based and adaptation-based strategies. While the mitigation-based strategy highlights high-exposure segments such as long ingress and egress, adaptation should be prioritized toward the middle or second half of the trip when a traveler is waiting for transit or transferring between routes. A comparison between the traditional additive approach and the dynamic approach presented also shows significant disparities, which, if overlooked, can mislead policy decisions.

This paper introduces a new method for change detection in psychometric studies based on the recently introduced pseudo Score statistic, for which the sampling distribution under the alternative hypothesis has been determined. Our approach has the advantage of simplicity in its computation, eliminating the need for resampling or simulations to obtain critical values. Additionally, it comes with a known null/alternative distribution, facilitating easy calculations for power levels and sample size planning. The paper indeed also discusses the topic of power analysis in segmented regression, namely the estimation of sample size or power level when the study data being collected focuses on a covariate expected to affect the mean response via a piecewise relationship with an unknown breakpoint. We run simulation results showing that our method outperforms other Tests for a Change Point (TFCP) with both normally distributed and binary data and carry out a real SAT Critical reading data analysis. The proposed test contributes to the framework of psychometric research, and it is available on the Comprehensive R Archive Network (CRAN) and in a more user-friendly Shiny App, both illustrated at the end of the paper.

Remote sensing data are increasingly available and frequently used to produce forest attributes maps. The sampling strategy of the calibration plots may directly affect predictions and map qualities. The aim of this manuscript is to evaluate models transferability at different spatial scales according to the sampling efforts and the calibration domain of these models. Forest inventory plots from locals and regionals networks were used to calibrate randomForest (RF) models for stand basal area predictions. Auxiliary data from ALS flights and a Sentinel-2 image were used. Model transferability was assessed by comparing models developed over a given area and applied elsewhere. Performances were measured in terms of precision (RMSE and bias), coefficient of determination (R2) and the proportion of extrapolated predictions. Regional networks were also thinned to evaluate the effect of sampling efforts on models' performances. Local models showed large bias and extrapolation issues when applied elsewhere. Local issues of regional models were also observed, raising transferability and extrapolation concerns. An increase in sampling efforts was shown to reduce extrapolation issues. The outcoming results of this study underline the importance of considering models' validity domain while producing forest attribute maps, since their transferability is of crucial importance from a forest management perspective.

Bayesian priors are investigated for detecting targets of known spectral signature (but unknown strength) in cluttered backgrounds. A specific problem is the construction (or ``sculpting'') of a Bayesian prior that uniformly outperforms its non-Bayesian counterpart, the nominally sub-optimal but widely used Generalized Likelihood Ratio Test (GLRT).

Polynomial neural networks have been implemented in a range of applications and present an advantageous framework for theoretical machine learning. A polynomial neural network of fixed architecture and activation degree gives an algebraic map from the network's weights to a set of polynomials. The image of this map is the space of functions representable by the network. Its Zariski closure is an affine variety known as a neurovariety. The dimension of a polynomial neural network's neurovariety provides a measure of its expressivity. In this work, we introduce the notion of the activation threshold of a network architecture which expresses when the dimension of a neurovariety achieves its theoretical maximum. In addition, we prove expressiveness results for polynomial neural networks with equi-width~architectures.

The dynamics of information diffusion in complex networks is widely studied in an attempt to understand how individuals communicate and how information travels and reaches individuals through interactions. However, complex networks often present community structure, and tools to analyse information diffusion on networks with communities are needed. In this paper, we develop theoretical tools using multi-type branching processes to model and analyse simple contagion information spread on a broad class of networks with community structure. We show how, by using limited information about the network -- the degree distribution within and between communities -- we can calculate standard statistical characteristics of the dynamics of information diffusion, such as the extinction probability, hazard function, and cascade size distribution. These properties can be estimated not only for the entire network but also for each community separately. Furthermore, we estimate the probability of information spreading from one community to another where it is not currently spreading. We demonstrate the accuracy of our framework by applying it to two specific examples: the Stochastic Block Model and a log-normal network with community structure. We show how the initial seeding location affects the observed cascade size distribution on a heavy-tailed network and that our framework accurately captures this effect.

In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources. But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings? We identify this situation as the \textit{Data Addition Dilemma}, demonstrating that adding training data in this multi-source scaling context can at times result in reduced overall accuracy, uncertain fairness outcomes, and reduced worst-subgroup performance. We find that this possibly arises from an empirically observed trade-off between model performance improvements due to data scaling and model deterioration from distribution shift. We thus establish baseline strategies for navigating this dilemma, introducing distribution shift heuristics to guide decision-making on which data sources to add in data scaling, in order to yield the expected model performance improvements. We conclude with a discussion of the required considerations for data collection and suggestions for studying data composition and scale in the age of increasingly larger models.

Spectral techniques are popular and robust approaches to data analysis. A prominent example is the use of eigenvectors of a Laplacian, constructed from data affinities, to identify natural data groupings or clusters, or to produce a simplified representation of data lying on a manifold. This tutorial concerns the dynamic Laplacian, which is a natural generalisation of the Laplacian to handle data that has a time component and lies on a time-evolving manifold. In this dynamic setting, clusters correspond to long-lived ``coherent'' collections. We begin with a gentle recap of spectral geometry before describing the dynamic generalisations. We also discuss computational methods and the automatic separation of many distinct features through the SEBA algorithm. The purpose of this tutorial is to bring together many results from the dynamic Laplacian literature into a single short document, written in an accessible style.