2024-10-29 | | Total: 15
Multivariate spatial disease mapping has become a pivotal part of everyday practice in social epidemiology. Despite the existence of several specifications for the relation between different outcomes, there is still a need for a new strategy that focuses on comparing the spatial risk patterns of different subgroups of the population. This paper introduces a new approach for detecting differences in spatial risk patterns between different populations at risk, using suicide-related emergency calls to study suicide risks in the Valencian Community (Spain).
In this paper, an analysis of hourly air temperatures in four groups of 32 stations of the UK highland (five stations), UK lowland (four stations), Italian highland (eleven stations), and Italian lowland (twelve stations) at various altitudes was conducted over the period from 2002 to 2021. The study aimed to examine the trends of each hour of the day in that period, over different averaging time windows (-10 day, -30 day, and -60 day). The trends were computed using the Mann-Kendall trend test and Sen's slope estimator. The similarity of trends within and across the groups of stations was assessed using the hierarchical clustering with dynamic time warping technique. An additional analysis was conducted to show the correlation of trends among the group of stations using the correlation distance matrix. Hierarchical clustering and distance correlation analysis show trend similarities and correlations, also indicating dissimilarities among different groups. Using 30 day averages, significant warming trends in specific months at the Italian stations are evident, especially in February, July, August, and December. The UK highland stations did not show statistically significant trends, but clear pattern similarities were found within the groups, especially in certain months. The ultimate goal of this paper is to provide insights into temperature dynamics and climate change characteristics on regional and diurnal scales.
Research on modeling the distributional aspects in sensor-based digital health (sDHT) data has grown significantly in recent years. Most existing approaches focus on using individual-specific density or quantile functions. However, there has been limited exploration to assess the practical utility of alternative distributional representations in clinical contexts collecting sDHT data. This study is motivated by accelerometry data collected on 246 individuals with multiple sclerosis (MS) representing a wide range of disability (Expanded Disability Status Scale, EDSS: 0-7). We consider five different individual-level distributional representations of minute-level activity counts: density, survival, hazard, quantile, and total time on test functions. For each of the five distributional representations, scalar-on-function regression fits linear discriminators for binary and continuously measured MS disability, and cross-validated discriminatory performance of these linear discriminators is compared across. The results show that individual-level hazard functions provide the highest discriminatory accuracy, more than double the accuracy compared to density functions. Individual-level quantile functions provided the second-highest discriminatory accuracy. These findings highlight the importance of focusing on distributional representations that capture the tail behavior of distributions when analyzing digital health data, especially in clinical contexts.
In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration - learning item parameters in a test - is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in [Sharpnack et al., 2024]. AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular, [Erickson et al., 2020]) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose the BanditCAT framework, a methodology motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability from IRT assumptions. We use Thompson sampling to balance between exploring items with different psychometric characteristics and selecting highly discriminative items that give more precise information about ability. To control item exposure, we inject noise through an additional randomization step before computing the Fisher information. This framework was used to initially launch two new item types on the DET practice test using limited training data. We outline some reliability and exposure metrics for the 5 practice test experiments that utilized this framework.
We introduce the almost goodness-of-fit test, a procedure to decide if a (parametric) model provides a good representation of the probability distribution generating the observed sample. We consider the approximate model determined by an M-estimator of the parameters as the best representative of the unknown distribution within the parametric class. The objective is the approximate validation of a distribution or an entire parametric family up to a pre-specified threshold value, the margin of error. The methodology also allows quantifying the percentage improvement of the proposed model compared to a non-informative (constant) one. The test statistic is the $\mathrm{L}^p$-distance between the empirical distribution function and the corresponding one of the estimated (parametric) model. The value of the parameter $p$ allows modulating the impact of the tails of the distribution in the validation of the model. By deriving the asymptotic distribution of the test statistic, as well as proving the consistency of its bootstrap approximation, we present an easy-to-implement and flexible method. The performance of the proposal is illustrated with a simulation study and the analysis of a real dataset.
In the literature on stochastic frontier models until the early 2000s, the joint consideration of spatial and temporal dimensions was often inadequately addressed, if not completely neglected. However, from an evolutionary economics perspective, the production process of the decision-making units constantly changes over both dimensions: it is not stable over time due to managerial enhancements and/or internal or external shocks, and is influenced by the nearest territorial neighbours. This paper proposes an extension of the Fusco and Vidoli [2013] SEM-like approach, which globally accounts for spatial and temporal effects in the term of inefficiency. In particular, coherently with the stochastic panel frontier literature, two different versions of the model are proposed: the time-invariant and the time-varying spatial stochastic frontier models. In order to evaluate the inferential properties of the proposed estimators, we first run Monte Carlo experiments and we then present the results of an application to a set of commonly referenced data, demonstrating robustness and stability of estimates across all scenarios.
The analysis of neural power spectra plays a crucial role in understanding brain function and dysfunction. While recent efforts have led to the development of methods for decomposing spectral data, challenges remain in performing statistical analysis and group-level comparisons. Here, we introduce Bayesian Spectral Decomposition (BSD), a Bayesian framework for analysing neural spectral power. BSD allows for the specification, inversion, comparison, and analysis of parametric models of neural spectra, addressing limitations of existing methods. We first establish the face validity of BSD on simulated data and show how it outperforms an established method (\fooof{}) for peak detection on artificial spectral data. We then demonstrate the efficacy of BSD on a group-level study of EEG spectra in 204 healthy subjects from the LEMON dataset. Our results not only highlight the effectiveness of BSD in model selection and parameter estimation, but also illustrate how BSD enables straightforward group-level regression of the effect of continuous covariates such as age. By using Bayesian inference techniques, BSD provides a robust framework for studying neural spectral data and their relationship to brain function and dysfunction.
Quantifying uncertainty in networks is an important step in modelling relationships and interactions between entities. We consider the challenge of bootstrapping an inhomogeneous random graph when only a single observation of the network is made and the underlying data generating function is unknown. We utilise an exchangeable network test that can empirically validate bootstrap samples generated by any method, by testing if the observed and bootstrapped networks are statistically distinguishable. We find that existing methods fail this test. To address this, we propose a principled, novel, distribution-free network bootstrap using k-nearest neighbour smoothing, that can regularly pass this exchangeable network test in both synthetic and real-data scenarios. We demonstrate the utility of this work in combination with the popular data visualisation method t-SNE, where uncertainty estimates from bootstrapping are used to explain whether visible structures represent real statistically sound structures.
In astronomical observations, the estimation of distances from parallaxes is a challenging task due to the inherent measurement errors and the non-linear relationship between the parallax and the distance. This study leverages ideas from robust Bayesian inference to tackle these challenges, investigating a broad class of prior densities for estimating distances with a reduced bias and variance. Through theoretical analysis, simulation experiments, and the application to data from the Gaia Data Release 1 (GDR1), we demonstrate that heavy-tailed priors provide more reliable distance estimates, particularly in the presence of large fractional parallax errors. Theoretical results highlight the "curse of a single observation," where the likelihood dominates the posterior, limiting the impact of the prior. Nevertheless, heavy-tailed priors can delay the explosion of posterior risk, offering a more robust framework for distance estimation. The findings suggest that reciprocal invariant priors, with polynomial decay in their tails, such as the Half-Cauchy and Product Half-Cauchy, are particularly well-suited for this task, providing a balance between bias reduction and variance control.
In the context of paid research studies and clinical trials, budget considerations and patient sampling from available populations are subject to inherent constraints. We introduce the R package CDsampling, which is the first to our knowledge to integrate optimal design theories within the framework of constrained sampling. This package offers the possibility to find both D-optimal approximate and exact allocations for samplings with or without constraints. Additionally, it provides functions to find constrained uniform sampling as a robust sampling strategy when the model information is limited. To demonstrate its efficacy, we provide simulated examples and a real-data example with datasets embedded in the package and compare them with classical sampling methods. Furthermore, it revisits the theoretical results of the Fisher information matrix for generalized linear models (including regular linear regression model) and multinomial logistic models, offering functions for its computation.
To unlock access to stronger winds, the offshore wind industry is advancing with significantly larger and taller wind turbines. This massive upscaling motivates a departure from univariate wind forecasting methods that traditionally focused on a single representative height. To fill this gap, we propose DeepMIDE--a statistical deep learning method which jointly models the offshore wind speeds across space, time, and height. DeepMIDE is formulated as a multi-output integro-difference equation model with a multivariate, nonstationary, and state-dependent kernel characterized by a set of advection vectors that encode the physics of wind field formation and propagation. Embedded within DeepMIDE, an advanced deep learning architecture learns these advection vectors from high dimensional streams of exogenous weather information, which, along with other parameters, are plugged back into the statistical model for probabilistic multi-height space-time forecasting. Tested on real-world data from future offshore wind energy sites in the Northeastern United States, the wind speed and power forecasts from DeepMIDE are shown to outperform those from prevalent time series, spatio-temporal, and deep learning methods.
Industrial applications often exhibit multiple in-control patterns due to varying operating conditions, which makes a single functional linear model (FLM) inadequate to capture the complexity of the true relationship between a functional quality characteristic and covariates, which gives rise to the multimode profile monitoring problem. This issue is clearly illustrated in the resistance spot welding (RSW) process in the automotive industry, where different operating conditions lead to multiple in-control states. In these states, factors such as electrode tip wear and dressing may influence the functional quality characteristic differently, resulting in distinct FLMs across subpopulations. To address this problem, this article introduces the functional mixture regression control chart (FMRCC) to monitor functional quality characteristics with multiple in-control patterns and covariate information, modeled using a mixture of FLMs. A monitoring strategy based on the likelihood ratio test is proposed to monitor any deviation from the estimated in-control heterogeneous population. An extensive Monte Carlo simulation study is performed to compare the FMRCC with competing monitoring schemes that have already appeared in the literature, and a case study in the monitoring of an RSW process in the automotive industry, which motivated this research, illustrates its practical applicability.
Causal discovery is a fundamental problem with applications spanning various areas in science and engineering. It is well understood that solely using observational data, one can only orient the causal graph up to its Markov equivalence class, necessitating interventional data to learn the complete causal graph. Most works in the literature design causal discovery policies with perfect interventions, i.e., they have access to infinite interventional samples. This study considers a Bayesian approach for learning causal graphs with limited interventional samples, mirroring real-world scenarios where such samples are usually costly to obtain. By leveraging the recent result of Wienöbst et al. (2023) on uniform DAG sampling in polynomial time, we can efficiently enumerate all the cut configurations and their corresponding interventional distributions of a target set, and further track their posteriors. Given any number of interventional samples, our proposed algorithm randomly intervenes on a set of target vertices that cut all the edges in the graph and returns a causal graph according to the posterior of each target set. When the number of interventional samples is large enough, we show theoretically that our proposed algorithm will return the true causal graph with high probability. We compare our algorithm against various baseline methods on simulated datasets, demonstrating its superior accuracy measured by the structural Hamming distance between the learned DAG and the ground truth. Additionally, we present a case study showing how this algorithm could be modified to answer more general causal questions without learning the whole graph. As an example, we illustrate that our method can be used to estimate the causal effect of a variable that cannot be intervened.
The vigor of potato plants, defined as the canopy area at the end of the exponential growth stage, depends on the origin and physiological state of the seed tuber. Experiments carried out with six potato varieties in three test fields over three years show that there is a 73%-90% correlation in the vigor of the plants from the same seedlot grown in different test fields. However, these correlations are not always observed on the level of individual varieties and vanish or become negative when the seed tubers and young plants experience environmental stress. A comprehensive study of the association between the vigor and the seed tuber biochemistry has revealed that, while 50%-70% of the variation in the plant vigor is explained by the tuber data, the vigor is dominated by the potato genotype. Analysis of individual predictors, such as the abundance of a particular metabolite, indicates that the vigor enhancing properties of the seed tubers differ between genotypes. Variety-specific models show that, for some varieties, up to 30% of the vigor variation within the variety is explained by and can be predicted from the tuber biochemistry, whereas, for other varieties, the association between the tuber composition and the vigor is much weaker.
Power outages have become increasingly frequent, intense, and prolonged in the US due to climate change, aging electrical grids, and rising energy demand. However, largely due to the absence of granular spatiotemporal outage data, we lack data-driven evidence and analytics-based metrics to quantify power system vulnerability. This limitation has hindered the ability to effectively evaluate and address vulnerability to power outages in US communities. Here, we collected ~179 million power outage records at 15-minute intervals across 3022 US contiguous counties (96.15% of the area) from 2014 to 2023. We developed a power system vulnerability assessment framework based on three dimensions (intensity, frequency, and duration) and applied interpretable machine learning models (XGBoost and SHAP) to compute Power System Vulnerability Index (PSVI) at the county level. Our analysis reveals a consistent increase in power system vulnerability over the past decade. We identified 318 counties across 45 states as hotspots for high power system vulnerability, particularly in the West Coast (California and Washington), the East Coast (Florida and the Northeast area), the Great Lakes megalopolis (Chicago-Detroit metropolitan areas), and the Gulf of Mexico (Texas). Heterogeneity analysis indicates that urban counties, counties with interconnected grids, and states with high solar generation exhibit significantly higher vulnerability. Our results highlight the significance of the proposed PSVI for evaluating the vulnerability of communities to power outages. The findings underscore the widespread and pervasive impact of power outages across the country and offer crucial insights to support infrastructure operators, policymakers, and emergency managers in formulating policies and programs aimed at enhancing the resilience of the US power infrastructure.