Methodology

Date: Thu, 9 May 2024 | Total: 13

#1 Causality Pursuit from Heterogeneous Environments via Neural Adversarial Invariance Learning [PDF] [Copy] [Kimi2]

Authors: Yihong Gu ; Cong Fang ; Peter Bühlmann ; Jianqing Fan

Statistics suffers from a fundamental problem, "the curse of endogeneity" -- the regression function, or more broadly the prediction risk minimizer with infinite data, may not be the target we wish to pursue. This is because when complex data are collected from multiple sources, the biases deviated from the interested (causal) association inherited in individuals or sub-populations are not expected to be canceled. Traditional remedies are of hindsight and restrictive in being tailored to prior knowledge like untestable cause-effect structures, resulting in methods that risk model misspecification and lack scalable applicability. This paper seeks to offer a purely data-driven and universally applicable method that only uses the heterogeneity of the biases in the data rather than following pre-offered commandments. Such an idea is formulated as a nonparametric invariance pursuit problem, whose goal is to unveil the invariant conditional expectation $m^\star(x)\equiv \mathbb{E}[Y^{(e)}|X_{S^\star}^{(e)}=x_{S^\star}]$ with unknown important variable set $S^\star$ across heterogeneous environments $e\in \mathcal{E}$. Under the structural causal model framework, $m^\star$ can be interpreted as certain data-driven causality in general. The paper contributes to proposing a novel framework, called Focused Adversarial Invariance Regularization (FAIR), formulated as a single minimax optimization program that can solve the general invariance pursuit problem. As illustrated by the unified non-asymptotic analysis, our adversarial estimation framework can attain provable sample-efficient estimation akin to standard regression under a minimal identification condition for various tasks and models. As an application, the FAIR-NN estimator realized by two Neural Network classes is highlighted as the first approach to attain statistically efficient estimation in general nonparametric invariance learning.

#2 Community detection in multi-layer bipartite networks [PDF] [Copy] [Kimi]

Author: Huan Qing

The problem of community detection in multi-layer undirected networks has received considerable attention in recent years. However, practical scenarios often involve multi-layer bipartite networks, where each layer consists of two distinct types of nodes. Existing community detection algorithms tailored for multi-layer undirected networks are not directly applicable to multi-layer bipartite networks. To address this challenge, this paper introduces a novel multi-layer degree-corrected stochastic co-block model specifically designed to capture the underlying community structure within multi-layer bipartite networks. Within this framework, we propose an efficient debiased spectral co-clustering algorithm for detecting nodes' communities. We establish the consistent estimation property of our proposed algorithm and demonstrate that an increased number of layers in bipartite networks improves the accuracy of community detection. Through extensive numerical experiments, we showcase the superior performance of our algorithm compared to existing methods. Additionally, we validate our algorithm by applying it to real-world multi-layer network datasets, yielding meaningful and insightful results.

#3 Fast Computation of Leave-One-Out Cross-Validation for $k$-NN Regression [PDF] [Copy] [Kimi]

Author: Motonobu Kanagawa

We describe a fast computation method for leave-one-out cross-validation (LOOCV) for $k$-nearest neighbours ($k$-NN) regression. We show that, under a tie-breaking condition for nearest neighbours, the LOOCV estimate of the mean square error for $k$-NN regression is identical to the mean square error of $(k+1)$-NN regression evaluated on the training data, multiplied by the scaling factor $(k+1)^2/k^2$. Therefore, to compute the LOOCV score, one only needs to fit $(k+1)$-NN regression only once, and does not need to repeat training-validation of $k$-NN regression for the number of training data. Numerical experiments confirm the validity of the fast computation method.

#4 Adaptive design of experiments methodology for noise resistance with unreplicated experiments [PDF] [Copy] [Kimi]

Authors: Lucas Caparini ; Gwynn J. Elfring ; Mauricio Ponga

A new gradient-based adaptive sampling method is proposed for design of experiments applications which balances space filling, local refinement, and error minimization objectives while reducing reliance on delicate tuning parameters. High order local maximum entropy approximants are used for metamodelling, which take advantage of boundary-corrected kernel density estimation to increase accuracy and robustness on highly clumped datasets, as well as conferring the resulting metamodel with some robustness against data noise in the common case of unreplicated experiments. Two-dimensional test cases are analyzed against full factorial and latin hypercube designs and compare favourably. The proposed method is then applied in a unique manner to the problem of adaptive spatial resolution in time-varying non-linear functions, opening up the possibility to adapt the method to solve partial differential equations.

#5 Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets [PDF] [Copy] [Kimi]

Authors: Leila Nombo ; Anne-Sophie Charest

Differential privacy (DP) has been accepted as a rigorous criterion for measuring the privacy protection offered by random mechanisms used to obtain statistics or, as we will study here, synthetic datasets from confidential data. Methods to generate such datasets are increasingly numerous, using varied tools including Bayesian models, deep neural networks and copulas. However, little is still known about how to properly perform statistical inference with these differentially private synthetic (DIPS) datasets. The challenge is for the analyses to take into account the variability from the synthetic data generation in addition to the usual sampling variability. A similar challenge also occurs when missing data is imputed before analysis, and statisticians have developed appropriate inference procedures for this case, which we tend extended to the case of synthetic datasets for privacy. In this work, we study the applicability of these procedures, based on combining rules, to the analysis of DIPS datasets. Our empirical experiments show that the proposed combining rules may offer accurate inference in certain contexts, but not in all cases.

#6 Weighted Particle-Based Optimization for Efficient Generalized Posterior Calibration [PDF] [Copy] [Kimi]

Author: Masahiro Tanaka

In the realm of statistical learning, the increasing volume of accessible data and increasing model complexity necessitate robust methodologies. This paper explores two branches of robust Bayesian methods in response to this trend. The first is generalized Bayesian inference, which introduces a learning rate parameter to enhance robustness against model misspecifications. The second is Gibbs posterior inference, which formulates inferential problems using generic loss functions rather than probabilistic models. In such approaches, it is necessary to calibrate the spread of the posterior distribution by selecting a learning rate parameter. The study aims to enhance the generalized posterior calibration (GPC) algorithm proposed by Syring and Martin (2019) [Biometrika, Volume 106, Issue 2, pp. 479-486]. Their algorithm chooses the learning rate to achieve the nominal frequentist coverage probability, but it is computationally intensive because it requires repeated posterior simulations for bootstrap samples. We propose a more efficient version of the GPC inspired by sequential Monte Carlo (SMC) samplers. A target distribution with a different learning rate is evaluated without posterior simulation as in the reweighting step in SMC sampling. Thus, the proposed algorithm can reach the desired value within a few iterations. This improvement substantially reduces the computational cost of the GPC. Its efficacy is demonstrated through synthetic and real data applications.

#7 On Correlation and Prediction Interval Reduction [PDF] [Copy] [Kimi]

Authors: Romain Piaget-Rossel ; Valentin Rousson

Pearson's correlation coefficient is a popular statistical measure to summarize the strength of association between two continuous variables. It is usually interpreted via its square as percentage of variance of one variable predicted by the other in a linear regression model. It can be generalized for multiple regression via the coefficient of determination, which is not straightforward to interpret in terms of prediction accuracy. In this paper, we propose to assess the prediction accuracy of a linear model via the prediction interval reduction (PIR) by comparing the width of the prediction interval derived from this model with the width of the prediction interval obtained without this model. At the population level, PIR is one-to-one related to the correlation and the coefficient of determination. In particular, a correlation of 0.5 corresponds to a PIR of only 13%. It is also the one's complement of the coefficient of alienation introduced at the beginning of last century. We argue that PIR is easily interpretable and useful to keep in mind how difficult it is to make accurate individual predictions, an important message in the era of precision medicine and artificial intelligence. Different estimates of PIR are compared in the context of a linear model and an extension of the PIR concept to non-linear models is outlined.

#8 Dependence-based fuzzy clustering of functional time series [PDF] [Copy] [Kimi]

Authors: Angel Lopez-Oriona ; Ying Sun ; Han Lin Shang

Time series clustering is an important data mining task with a wide variety of applications. While most methods focus on time series taking values on the real line, very few works consider functional time series. However, functional objects frequently arise in many fields, such as actuarial science, demography or finance. Functional time series are indexed collections of infinite-dimensional curves viewed as random elements taking values in a Hilbert space. In this paper, the problem of clustering functional time series is addressed. To this aim, a distance between functional time series is introduced and used to construct a clustering procedure. The metric relies on a measure of serial dependence which can be seen as a natural extension of the classical quantile autocorrelation function to the functional setting. Since the dynamics of the series may vary over time, we adopt a fuzzy approach, which enables the procedure to locate each series into several clusters with different membership degrees. The resulting algorithm can group series generated from similar stochastic processes, reaching accurate results with series coming from a broad variety of functional models and requiring minimum hyperparameter tuning. Several simulation experiments show that the method exhibits a high clustering accuracy besides being computationally efficient. Two interesting applications involving high-frequency financial time series and age-specific mortality improvement rates illustrate the potential of the proposed approach.

#9 Guiding adaptive shrinkage by co-data to improve regression-based prediction and feature selection [PDF] [Copy] [Kimi]

Authors: Mark A. van de Wiel ; Wessel N. van Wieringen

The high dimensional nature of genomics data complicates feature selection, in particular in low sample size studies - not uncommon in clinical prediction settings. It is widely recognized that complementary data on the features, `co-data', may improve results. Examples are prior feature groups or p-values from a related study. Such co-data are ubiquitous in genomics settings due to the availability of public repositories. Yet, the uptake of learning methods that structurally use such co-data is limited. We review guided adaptive shrinkage methods: a class of regression-based learners that use co-data to adapt the shrinkage parameters, crucial for the performance of those learners. We discuss technical aspects, but also the applicability in terms of types of co-data that can be handled. This class of methods is contrasted with several others. In particular, group-adaptive shrinkage is compared with the better-known sparse group-lasso by evaluating feature selection. Finally, we demonstrate the versatility of the guided shrinkage methodology by showing how to `do-it-yourself': we integrate implementations of a co-data learner and the spike-and-slab prior for the purpose of improving feature selection in genetics studies.

#10 A joint model for DHS and MICS surveys: Spatial modeling with anonymized locations [PDF] [Copy] [Kimi]

Authors: John Paige ; Geir-Arne Fuglstad ; Andrea Riebler

Anonymizing the GPS locations of observations can bias a spatial model's parameter estimates and attenuate spatial predictions when improperly accounted for, and is relevant in applications from public health to paleoseismology. In this work, we demonstrate that a newly introduced method for geostatistical modeling in the presence of anonymized point locations can be extended to account for more general kinds of positional uncertainty due to location anonymization, including both jittering (a form of random perturbations of GPS coordinates) and geomasking (reporting only the name of the area containing the true GPS coordinates). We further provide a numerical integration scheme that flexibly accounts for the positional uncertainty as well as spatial and covariate information. We apply the method to women's secondary education completion data in the 2018 Nigeria demographic and health survey (NDHS) containing jittered point locations, and the 2016 Nigeria multiple indicator cluster survey (NMICS) containing geomasked locations. We show that accounting for the positional uncertainty in the surveys can improve predictions in terms of their continuous rank probability score.

#11 Combining Rollout Designs and Clustering for Causal Inference under Low-order Interference [PDF] [Copy] [Kimi]

Authors: Mayleen Cortez-Rodriguez ; Matthew Eichhorn ; Christina Lee Yu

Estimating causal effects under interference is pertinent to many real-world settings. However, the true interference network may be unknown to the practitioner, precluding many existing techniques that leverage this information. A recent line of work with low-order potential outcomes models uses staggered rollout designs to obtain unbiased estimators that require no network information. However, their use of polynomial extrapolation can lead to prohibitively high variance. To address this, we propose a two-stage experimental design that restricts treatment rollout to a sub-population. We analyze the bias and variance of an interpolation-style estimator under this experimental design. Through numerical simulations, we explore the trade-off between the error attributable to the subsampling of our experimental design and the extrapolation of the estimator. Under low-order interactions models with degree greater than 1, the proposed design greatly reduces the error of the polynomial interpolation estimator, such that it outperforms baseline estimators, especially when the treatment probability is small.

#12 A goodness-of-fit diagnostic for count data derived from half-normal plots with a simulated envelope [PDF] [Copy] [Kimi]

Authors: Darshana Jayakumari ; Jochen Einbeck ; John Hinde ; Julien Mainguy ; Rafael de Andrade Moral

Traditional methods of model diagnostics may include a plethora of graphical techniques based on residual analysis, as well as formal tests (e.g. Shapiro-Wilk test for normality and Bartlett test for homogeneity of variance). In this paper we derive a new distance metric based on the half-normal plot with a simulation envelope, a graphical model evaluation method, and investigate its properties through simulation studies. The proposed metric can help to assess the fit of a given model, and also act as a model selection criterion by being comparable across models, whether based or not on a true likelihood. More specifically, it quantitatively encompasses the model evaluation principles and removes the subjective bias when closely related models are involved. We validate the technique by means of an extensive simulation study carried out using count data, and illustrate with two case studies in ecology and fisheries research.

#13 Multivariate group sequential tests for global summary statistics [PDF] [Copy] [Kimi]

Authors: Abigail J. Burdon ; Thomas Jaki

We describe group sequential tests which efficiently incorporate information from multiple endpoints allowing for early stopping at pre-planned interim analyses. We formulate a testing procedure where several outcomes are examined, and interim decisions are based on a global summary statistic. An error spending approach to this problem is defined which allows for unpredictable group sizes and nuisance parameters such as the correlation between endpoints. We present and compare three methods for implementation of the testing procedure including numerical integration, the Delta approximation and Monte Carlo simulation. In our evaluation, numerical integration techniques performed best for implementation with error rate calculations accurate to five decimal places. Our proposed testing method is flexible and accommodates summary statistics derived from general, non-linear functions of endpoints informed by the statistical model. Type 1 error rates are controlled, and sample size calculations can easily be performed to satisfy power requirements.