2025-04-07 | | Total: 50
Estimating causal effects from observational data is inherently challenging due to the lack of observable counterfactual outcomes and even the presence of unmeasured confounding. Traditional methods often rely on restrictive, untestable assumptions or necessitate valid instrumental variables, significantly limiting their applicability and robustness. In this paper, we introduce Augmented Causal Effect Estimation (ACEE), an innovative approach that utilizes synthetic data generated by a diffusion model to enhance causal effect estimation. By fine-tuning pre-trained generative models, ACEE simulates counterfactual scenarios that are otherwise unobservable, facilitating accurate estimation of individual and average treatment effects even under unmeasured confounding. Unlike conventional methods, ACEE relaxes the stringent unconfoundedness assumption, relying instead on an empirically checkable condition. Additionally, a bias-correction mechanism is introduced to mitigate synthetic data inaccuracies. We provide theoretical guarantees demonstrating the consistency and efficiency of the ACEE estimator, alongside comprehensive empirical validation through simulation studies and benchmark datasets. Results confirm that ACEE significantly improves causal estimation accuracy, particularly in complex settings characterized by nonlinear relationships and heteroscedastic noise.
This paper reviews and details methods for state policy evaluation to guide selection of a research approach based on evaluation setting and available data. We highlight key design considerations for an analysis, including treatment and control group selection, timing of policy adoption, expected effect heterogeneity, and data considerations. We then provide an overview of analytic approaches and differentiate between methods based on evaluation context, such as settings with no control units, a single treated unit, multiple treated units, or with multiple treatment cohorts. Methods discussed include interrupted time series models, difference-in-differences estimators, autoregressive models, and synthetic control methods, along with method extensions which address issues like staggered policy adoption and heterogenous treatment effects. We end with an illustrative example, applying the developed framework to evaluate the impacts of state-level naloxone standing order policies on overdose rates. Overall, we provide researchers with an approach for deciding on methods for state policy evaluations, which can be used to select study designs and inform methodological choices.
This study investigates the relationship between environmental sustainability policies and tourism flows across Italian provinces using a Spatial Durbin Error Model (SDEM) within a gravity framework. By incorporating both public and corporate environmental initiatives, the analysis highlights the direct and spatial spillover effects of sustainability measures on tourism demand. The findings indicate that corporate-led initiatives, such as ecocertifications and green investments, exert a stronger direct influence on tourism flows compared to public measures, underscoring the visibility and immediate impact of private sector actions. However, both types of initiatives generate significant positive spatial spillovers, suggesting that sustainability efforts extend beyond local boundaries. These results demonstrate the interconnected nature of regional tourism systems and emphasize the critical role of coordinated sustainability policies in fostering tourism growth while promoting environmental protection. By addressing the spatial interdependencies of tourism flows and sustainability practices, this research provides valuable insights for policymakers and stakeholders seeking to improve sustainable tourism development at regional and national levels.
Critical bandwidth (CB) is used to test the multimodality of densities and regression functions, as well as for clustering methods. CB tests are known to be inconsistent if the function of interest is constant ("flat") over even a small interval, and to suffer from low power and incorrect size in finite samples if the function has a relatively small derivative over an interval. This paper proposes a solution, flatness-robust CB (FRCB), that exploits the novel observation that the inconsistency manifests only from regions consistent with the null hypothesis, and thus identifying and excluding them does not alter the null or alternative sets. I provide sufficient conditions for consistency of FRCB, and simulations of a test of regression monotonicity demonstrate the finite-sample properties of FRCB compared with CB for various regression functions. Surprisingly, FRCB performs better than CB in some cases where there are no flat regions, which can be explained by FRCB essentially giving more importance to parts of the function where there are larger violations of the null hypothesis. I illustrate the usefulness of FRCB with an empirical analysis of the monotonicity of the conditional mean function of radiocarbon age with respect to calendar age.
For any forecasting application, evaluation of forecasts is an important task. For example, in the field of renewable energy sources there is high variability and uncertainty of power production, which makes forecasting and the evaluation hereof crucial both for power trading and power grid balancing. In particular, probabilistic forecasts represented by ensembles are popular due to their ability to cover the full range of scenarios that can occur, thus enabling forecast users to make more informed decisions than what would be possible with simple deterministic forecasts. The selection of open source software that supports evaluation of ensemble forecasts, and especially event detection, is currently limited. As a solution, evalprob4cast is a new R-package for probabilistic forecast evaluation that aims to provide its users with all the tools needed for the assessment of ensemble forecasts, in the form of metrics and visualization methods. Both univariate and multivariate probabilistic forecasts as well as event detection are covered. Furthermore, it offers a user-friendly design where all of the evaluation methods can be applied in a fast and easy way, as long as the input data is organized in accordance with the format defined by the package. While its development is motivated by forecasting of renewables, the package can be used for any application with ensemble forecasts.
Operator learning has emerged as a powerful tool in scientific computing for approximating mappings between infinite-dimensional function spaces. A primary application of operator learning is the development of surrogate models for the solution operators of partial differential equations (PDEs). These methods can also be used to develop black-box simulators to model system behavior from experimental data, even without a known mathematical model. In this article, we begin by formalizing operator learning as a function-to-function regression problem and review some recent developments in the field. We also discuss PDE-specific operator learning, outlining strategies for incorporating physical and mathematical constraints into architecture design and training processes. Finally, we end by highlighting key future directions such as active data collection and the development of rigorous uncertainty quantification frameworks.
Motivated by a study on deception and counter-deception, this paper addresses the problem of identifying an agent's target as it seeks to reach one of two targets in a given environment. In practice, an agent may initially follow a strategy to aim at one target but decide to switch to another midway. Such a strategy can be deceptive when the counterpart only has access to imperfect observations, which include heavily corrupted sensor noise and possible outliers, making it difficult to visually identify the agent's true intent. To counter deception and identify the true target, we utilize prior knowledge of the agent's dynamics and the imprecisely observed partial trajectory of the agent's states to dynamically update the estimation of the posterior probability of whether a deceptive switch has taken place. However, existing methods in the literature have not achieved effective deception identification within a reasonable computation time. We propose a set of outlier-robust change detection methods to track relevant change-related statistics efficiently, enabling the detection of deceptive strategies in hidden nonlinear dynamics with reasonable computational effort. The performance of the proposed framework is examined for Weapon-Target Assignment (WTA) detection under deceptive strategies using random simulations in the kinematics model with external forcing.
The impact of wildfire smoke on air quality is a growing concern, contributing to air pollution through a complex mixture of chemical species with important implications for public health. While previous studies have primarily focused on its association with total particulate matter (PM2.5), the causal relationship between wildfire smoke and the chemical composition of PM2.5 remains largely unexplored. Exposure to these chemical mixtures plays a critical role in shaping public health, yet capturing their relationships requires advanced statistical methods capable of modeling the complex dependencies among chemical species. To fill this gap, we propose a Bayesian causal regression factor model that estimates the multivariate causal effects of wildfire smoke on the concentration of 27 chemical species in PM2.5 across the United States. Our approach introduces two key innovations: (i) a causal inference framework for multivariate potential outcomes, and (ii) a novel Bayesian factor model that employs a probit stick-breaking process as prior for treatment-specific factor scores. By focusing on factor scores, our method addresses the missing data challenge common in causal inference and enables a flexible, data-driven characterization of the latent factor structure, which is crucial to capture the complex correlation among multivariate outcomes. Through Monte Carlo simulations, we show the model's accuracy in estimating the causal effects in multivariate outcomes and characterizing the treatment-specific latent structure. Finally, we apply our method to US air quality data, estimating the causal effect of wildfire smoke on 27 chemical species in PM2.5, providing a deeper understanding of their interdependencies.
We consider a classical First-order Vector AutoRegressive (VAR(1)) model, where we interpret the autoregressive interaction matrix as influence relationships among the components of the VAR(1) process that can be encoded by a weighted directed graph. A majority of previous work studies the structural identifiability of the graph based on time series observations and therefore relies on dynamical information. In this work we assume that an equilibrium exists, and study instead the identifiability of the graph from the stationary distribution, meaning that we seek a way to reconstruct the influence graph underlying the dynamic network using only static information. We use an approach from algebraic statistics that characterizes models using the Jacobian matroids associated with the parametrization of the models, and we introduce sufficient graphical conditions under which different graphs yield distinct steady-state distributions. Additionally, we illustrate how our results could be applied to characterize networks inspired by ecological research.
Micro-level data with granular spatial and temporal information are becoming increasingly available to social scientists. Most researchers aggregate such data into a convenient panel data format and apply standard causal inference methods. This approach, however, has two limitations. First, data aggregation results in the loss of detailed geo-location and temporal information, leading to potential biases. Second, most panel data methods either ignore spatial spillover and temporal carryover effects or impose restrictive assumptions on their structure. We introduce a general methodological framework for spatiotemporal causal inference with arbitrary spillover and carryover effects. Under this general framework, we demonstrate how to define and estimate causal quantities of interest, explore heterogeneous treatment effects, investigate causal mechanisms, and visualize the results to facilitate their interpretation. We illustrate the proposed methodology through an analysis of airstrikes and insurgent attacks in Iraq. The open-source software package geocausal implements all of our methods.
In stochastic optimal control and conditional generative modelling, a central computational task is to modify a reference diffusion process to maximise a given terminal-time reward. Most existing methods require this reward to be differentiable, using gradients to steer the diffusion towards favourable outcomes. However, in many practical settings, like diffusion bridges, the reward is singular, taking an infinite value if the target is hit and zero otherwise. We introduce a novel framework, based on Malliavin calculus and path-space integration by parts, that enables the development of methods robust to such singular rewards. This allows our approach to handle a broad range of applications, including classification, diffusion bridges, and conditioning without the need for artificial observational noise. We demonstrate that our approach offers stable and reliable training, outperforming existing techniques.
Given i.i.d. observations uniformly distributed on a closed submanifold of the Euclidean space, we study higher-order generalizations of graph Laplacians, so-called Hodge Laplacians on graphs, as approximations of the Laplace-Beltrami operator on differential forms. Our main result is a high-probability error bound for the associated Dirichlet forms. This bound improves existing Dirichlet form error bounds for graph Laplacians in the context of Laplacian Eigenmaps, and it provides insights into the Betti numbers studied in topological data analysis and the complementing positive part of the spectrum.
Nonparametric regression with random design is considered. The L2 error with integration with respect to the design measure is used as the error criterion. An over-parametrized deep neural network regression estimate with logistic activation function is defined, where all weights are learned by gradient descent. It is shown that the estimate achieves a nearly optimal rate of convergence in case that the regression function is (p,C)--smooth.
A new formula for Marchenko-Pastur inversion is derived and used for inference of population linear spectral statistics. The formula allows for estimation of the Stieltjes transform of the population spectral distribution sH(z), when z is sufficiently far from the support of the population spectral distribution H. If the dimension d and the sample size n go to infinity simultaneously such that dn→c>0, the estimation error is shown to be asymptotically less than nεn for arbitrary ε>0. By integrating along a curve around the support of H, estimators for population linear spectral statistics are constructed, which benefit from this convergence speed of nεn.
We propose efficient computational methods to fit multivariate Gaussian additive models, where the mean vector and the covariance matrix are allowed to vary with covariates, in an empirical Bayes framework. To guarantee the positive-definiteness of the covariance matrix, we model the elements of an unconstrained parametrisation matrix, focussing particularly on the modified Cholesky decomposition and the matrix logarithm. A key computational challenge arises from the fact that, for the model class considered here, the number of parameters increases quadratically with the dimension of the response vector. Hence, here we discuss how to achieve fast computation and low memory footprint in moderately high dimensions, by exploiting parsimonious model structures, sparse derivative systems and by employing block-oriented computational methods. Methods for building and fitting multivariate Gaussian additive models are provided by the SCM R package, available at https://github.com/VinGioia90/SCM, while the code for reproducing the results in this paper is available at https://github.com/VinGioia90/SACM.
Improving energy efficiency of building heating systems is essential for reducing global energy consumption and greenhouse gas emissions. Traditional control methods in buildings rely on static heating curves based solely on outdoor temperature measurements, neglecting system state and free heat sources like solar gain. Model predictive control (MPC) not only addresses these limitations but further optimizes heating control by incorporating weather forecasts and system state predictions. However, current industrial MPC solutions often use simplified physics-inspired models, which compromise accuracy for interpretability. While purely data-driven models offer better predictive performance, they face challenges like overfitting and lack of transparency. To bridge this gap, we propose a Bayesian Long Short-Term Memory (LSTM) architecture for indoor temperature modeling. Our experiments across 100 real-world buildings demonstrate that the Bayesian LSTM outperforms an industrial physics-based model in predictive accuracy, enabling potential for improved energy efficiency and thermal comfort if deployed in heating MPC solutions. Over deterministic black-box approaches, the Bayesian framework provides additional advantages by improving generalization ability and allowing interpretation of predictions via uncertainty quantification. This work advances data-driven heating control by balancing predictive performance with the transparency and reliability required for real-world heating MPC applications.
Modeling and forecasting interval-valued time series (ITS) have attracted considerable attention due to their growing presence in various contexts. To the best of our knowledge, there have been no efforts to model large-scale ITS. In this paper, we propose a feature extraction procedure for large-scale ITS, which involves key steps such as auto-segmentation and clustering, and feature transfer learning. This procedure can be seamlessly integrated with any suitable prediction models for forecasting purposes. Specifically, we transform the automatic segmentation and clustering of ITS into the estimation of Toeplitz sparse precision matrices and assignment set. The majorization-minimization algorithm is employed to convert this highly non-convex optimization problem into two subproblems. We derive efficient dynamic programming and alternating direction method to solve these two subproblems alternately and establish their convergence properties. By employing the Joint Recurrence Plot (JRP) to image subsequence and assigning a class label to each cluster, an image dataset is constructed. Then, an appropriate neural network is chosen to train on this image dataset and used to extract features for the next step of forecasting. Real data applications demonstrate that the proposed method can effectively obtain invariant representations of the raw data and enhance forecasting performance.
Accurate tuning of hyperparameters is crucial to ensure that models can generalise effectively across different settings. In this paper, we present theoretical guarantees for hyperparameter selection using variational Bayes in the nonparametric regression model. We construct a variational approximation to a hierarchical Bayes procedure, and derive upper bounds for the contraction rate of the variational posterior in an abstract setting. The theory is applied to various Gaussian process priors and variational classes, resulting in minimax optimal rates. Our theoretical results are accompanied with numerical analysis both on synthetic and real world data sets.
In recent years, the modeling and analysis of interval-valued time series have garnered significant attention in the fields of econometrics and statistics. However, the existing literature primarily focuses on regression tasks while neglecting classification aspects. In this paper, we propose an adaptive approach for interval-valued time series classification. Specifically, we represent interval-valued time series using convex combinations of upper and lower bounds of intervals and transform these representations into images based on point-valued time series imaging methods. We utilize a fine-grained image classification neural network to classify these images, to achieve the goal of classifying the original interval-valued time series. This proposed method is applicable to both univariate and multivariate interval-valued time series. On the optimization front, we treat the convex combination coefficients as learnable parameters similar to the parameters of the neural network and provide an efficient estimation method based on the alternating direction method of multipliers (ADMM). On the theoretical front, under specific conditions, we establish a margin-based multiclass generalization bound for generic CNNs composed of basic blocks involving convolution, pooling, and fully connected layers. Through simulation studies and real data applications, we validate the effectiveness of the proposed method and compare its performance against a wide range of point-valued time series classification methods.
In this paper, we present a novel feature extraction procedure to predict interval-valued time series by combing transfer learning and imaging approaches. Initially, we represent interval-valued time series using a bivariate point-valued time series, which serves as a representative form. We first transform each time series into images by employing various imaging approaches such as recurrence plot, gramian angular summation/difference field, and Markov transition field, and construct an image dataset by treating each imaging method's output as a separate class. Based on this dataset, we train several candidates for a feature extraction network (FEN), specifically ResNet with varying layers. Then we choose the penultimate layer of the FEN to extract the most relevant features from the transformed images. We integrate the extracted features into conventional predictive models to formulate the corresponding prediction models. To formulate prediction, we integrate the extracted features into a regular prediction model. The proposed methods are evaluated based on the S\&P 500 index and three data-generating processes (DGPs), and the experimental results demonstrate a notable improvement in prediction performance compared to existing methods.
In this paper, we propose an efficient importance sampling algorithm for rare event simulation under copula models. In the algorithm, the derived optimal probability measure is based on the criterion of minimizing the variance of the importance sampling estimator within a parametric exponential tilting family. Since the copula model is defined by its marginals and a copula function, and its moment-generating function is difficult to derive, we apply the transform likelihood ratio method to first identify an alternative exponential tilting family, after which we obtain simple and explicit expressions of equations. Then, the optimal alternative probability measure can be calculated under this transformed exponential tilting family. The proposed importance sampling framework is quite general and can be implemented for many classes of copula models, including some traditional parametric copula families and a class of semiparametric copulas called regular vine copulas, from which sampling is feasible. The theoretical results of the logarithmic efficiency and bounded relative error are proved for some commonly-used copula models under the case of simple rare events. Monte Carlo experiments are conducted, in which we study the relative efficiency of the crude Monte Carlo estimator with respect to the proposed importance-sampling-based estimators, such that substantial variance reductions are obtained in comparison to the standard Monte Carlo estimators.
Bayesian optimization based on Gaussian process upper confidence bound (GP-UCB) has a theoretical guarantee for optimizing black-box functions. Black-box functions often have input uncertainty, but even in this case, GP-UCB can be extended to optimize evaluation measures called robustness measures. However, GP-UCB-based methods for robustness measures include a trade-off parameter β, which must be excessively large to achieve theoretical validity, just like the original GP-UCB. In this study, we propose a new method called randomized robustness measure GP-UCB (RRGP-UCB), which samples the trade-off parameter β from a probability distribution based on a chi-squared distribution and avoids explicitly specifying β. The expected value of β is not excessively large. Furthermore, we show that RRGP-UCB provides tight bounds on the expected value of regret based on the optimal solution and estimated solutions. Finally, we demonstrate the usefulness of the proposed method through numerical experiments.
As big data continues to grow, statistical inference for multivariate functional data (MFD) has become crucial. Although recent advancements have been made in testing the equality of mean functions, research on testing linear hypotheses for mean functions remains limited. Current methods primarily consist of permutation-based tests or asymptotic tests. However, permutation-based tests are known to be time-consuming, while asymptotic tests typically require larger sample sizes to maintain an accurate Type I error rate. This paper introduces three finite-sample tests that modify traditional MANOVA methods to tackle the general linear hypothesis testing problem for MFD. The test statistics rely on two symmetric, nonnegative-definite matrices, approximated by Wishart distributions, with degrees of freedom estimated via a U-statistics-based method. The proposed tests are affine-invariant, computationally more efficient than permutation-based tests, and better at controlling significance levels in small samples compared to asymptotic tests. A real-data example further showcases their practical utility.
In this work, we propose a novel particle-based variational inference (ParVI) method that accelerates the EVI-Im. Inspired by energy quadratization (EQ) and operator splitting techniques for gradient flows, our approach efficiently drives particles towards the target distribution. Unlike EVI-Im, which employs the implicit Euler method to solve variational-preserving particle dynamics for minimizing the KL divergence, derived using a "discretize-then-variational" approach, the proposed algorithm avoids repeated evaluation of inter-particle interaction terms, significantly reducing computational cost. The framework is also extensible to other gradient-based sampling techniques. Through several numerical experiments, we demonstrate that our method outperforms existing ParVI approaches in efficiency, robustness, and accuracy.
Sequential multiple assignment randomized trials mimic the actual treatment processes experienced by physicians and patients in clinical settings and inform the comparative effectiveness of dynamic treatment regimes. In such trials, patients go through multiple stages of treatment, and the treatment assignment is adapted over time based on individual patient characteristics such as disease status and treatment history. In this work, we develop and evaluate statistically valid interim monitoring approaches to allow for early termination of sequential multiple assignment randomized trials for efficacy targeting survival outcomes. We propose a weighted log-rank Chi-square statistic to account for overlapping treatment paths and quantify how the log-rank statistics at two different analysis points are correlated. Efficacy boundaries at multiple interim analyses can then be established using the Pocock, O'Brien Fleming, and Lan-Demets boundaries. We run extensive simulations to comparatively evaluate the operating characteristics (type I error and power) of our interim monitoring procedure based on the proposed statistic and another existing statistic. The methods are demonstrated via an analysis of a neuroblastoma dataset.