2024-10-29 | | Total: 92
We propose a general transfer learning framework for clustering given a main dataset and an auxiliary one about the same subjects. The two datasets may reflect similar but different latent grouping structures of the subjects. We propose an adaptive transfer clustering (ATC) algorithm that automatically leverages the commonality in the presence of unknown discrepancy, by optimizing an estimated bias-variance decomposition. It applies to a broad class of statistical models including Gaussian mixture models, stochastic block models, and latent class models. A theoretical analysis proves the optimality of ATC under the Gaussian mixture model and explicitly quantifies the benefit of transfer. Extensive simulations and real data experiments confirm our method's effectiveness in various scenarios.
Multivariate spatial disease mapping has become a pivotal part of everyday practice in social epidemiology. Despite the existence of several specifications for the relation between different outcomes, there is still a need for a new strategy that focuses on comparing the spatial risk patterns of different subgroups of the population. This paper introduces a new approach for detecting differences in spatial risk patterns between different populations at risk, using suicide-related emergency calls to study suicide risks in the Valencian Community (Spain).
Marine Protected Areas (MPAs) have been established globally to conserve marine resources. Given their maintenance costs and impact on commercial fishing, it is critical to evaluate their effectiveness to support future conservation. In this paper, we use data collected from the Australian coast to estimate the effect of MPAs on biodiversity. Environmental studies such as these are often observational, and processes of interest exhibit spatial dependence, which presents challenges in estimating the causal effects. Spatial data can also be subject to preferential sampling, where the sampling locations are related to the response variable, further complicating inference and prediction. To address these challenges, we propose a spatial causal inference method that simultaneously accounts for unmeasured spatial confounders in both the sampling process and the treatment allocation. We prove the identifiability of key parameters in the model and the consistency of the posterior distributions of those parameters. We show via simulation studies that the causal effect of interest can be reliably estimated under the proposed model. The proposed method is applied to assess the effect of MPAs on fish biomass. We find evidence of preferential sampling and that properly accounting for this source of bias impacts the estimate of the causal effect.
Covariance matrix estimation is an important problem in multivariate data analysis, both from theoretical as well as applied points of view. Many simple and popular covariance matrix estimators are known to be severely affected by model misspecification and the presence of outliers in the data; on the other hand robust estimators with reasonably high efficiency are often computationally challenging for modern large and complex datasets. In this work, we propose a new, simple, robust and highly efficient method for estimation of the location vector and the scatter matrix for elliptically symmetric distributions. The proposed estimation procedure is designed in the spirit of the minimum density power divergence (DPD) estimation approach with appropriate modifications which makes our proposal (sequential minimum DPD estimation) computationally very economical and scalable to large as well as higher dimensional datasets. Consistency and asymptotic normality of the proposed sequential estimators of the multivariate location and scatter are established along with asymptotic positive definiteness of the estimated scatter matrix. Robustness of our estimators are studied by means of influence functions. All theoretical results are illustrated further under multivariate normality. A large-scale simulation study is presented to assess finite sample performances and scalability of our method in comparison to the usual maximum likelihood estimator (MLE), the ordinary minimum DPD estimator (MDPDE) and other popular non-parametric methods. The applicability of our method is further illustrated with a real dataset on credit card transactions.
Cascade models serve as effective tools for understanding the propagation of information and diseases within social networks. Nevertheless, their applicability becomes constrained when the states of the agents (nodes) are hidden and can only be inferred through indirect observations or symptoms. This study proposes a Mapper-based strategy to infer the status of agents within a hidden information cascade model using expert knowledge. To verify and demonstrate the method we identify agents who are likely to take advantage of information obtained from an inside information network. We do this using data on insider networks and stock market transactions. Recognizing the sensitive nature of allegations of insider trading, we design a conservative approach to minimize false positives, ensuring that innocent agents are not wrongfully implicated. The Mapper-based results systematically outperform other methods, such as clustering and unsupervised anomaly detection, on synthetic data. We also apply the method to empirical data and verify the results using a statistical validation method based on persistence homology. Our findings highlight that the proposed Mapper-based technique successfully identifies a subpopulation of opportunistic agents within the information cascades. The adaptability of this method to diverse data types and sizes is demonstrated, with potential for tailoring for specific applications.
When comparing multiple groups in clinical trials, we are not only interested in whether there is a difference between any groups but rather the location. Such research questions lead to testing multiple individual hypotheses. To control the familywise error rate (FWER), we must apply some corrections or introduce tests that control the FWER by design. In the case of time-to-event data, a Bonferroni-corrected log-rank test is commonly used. This approach has two significant drawbacks: (i) it loses power when the proportional hazards assumption is violated [1] and (ii) the correction generally leads to a lower power, especially when the test statistics are not independent [2]. We propose two new tests based on combined weighted log-rank tests. One as a simple multiple contrast test of weighted log-rank tests and one as an extension of the so-called CASANOVA test [3]. The latter was introduced for factorial designs. We propose a new multiple contrast test based on the CASANOVA approach. Our test promises to be more powerful under crossing hazards and eliminates the need for additional p-value correction. We assess the performance of our tests through extensive Monte Carlo simulation studies covering both proportional and non-proportional hazard scenarios. Finally, we apply the new and reference methods to a real-world data example. The new approaches control the FWER and show reasonable power in all scenarios. They outperform the adjusted approaches in some non-proportional settings in terms of power.
Many practical applications of online reinforcement learning require the satisfaction of safety constraints while learning about the unknown environment. In this work, we study Linear Quadratic Regulator (LQR) learning with unknown dynamics, but with the additional constraint that the position must stay within a safe region for the entire trajectory with high probability. Unlike in previous works, we allow for both bounded and unbounded noise distributions and study stronger baselines of nonlinear controllers that are better suited for constrained problems than linear controllers. Due to these complications, we focus on 1-dimensional state- and action- spaces, however we also discuss how we expect the high-level takeaways can generalize to higher dimensions. Our primary contribution is the first $\tilde{O}_T(\sqrt{T})$-regret bound for constrained LQR learning, which we show relative to a specific baseline of non-linear controllers. We then prove that, for any non-linear baseline satisfying natural assumptions, $\tilde{O}_T(\sqrt{T})$-regret is possible when the noise distribution has sufficiently large support and $\tilde{O}_T(T^{2/3})$-regret is possible for any subgaussian noise distribution. An overarching theme of our results is that enforcing safety provides "free exploration" that compensates for the added cost of uncertainty in safety constrained control, resulting in the same regret rate as in the unconstrained problem.
In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration - learning item parameters in a test - is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in [Sharpnack et al., 2024]. AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular, [Erickson et al., 2020]) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose the BanditCAT framework, a methodology motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability from IRT assumptions. We use Thompson sampling to balance between exploring items with different psychometric characteristics and selecting highly discriminative items that give more precise information about ability. To control item exposure, we inject noise through an additional randomization step before computing the Fisher information. This framework was used to initially launch two new item types on the DET practice test using limited training data. We outline some reliability and exposure metrics for the 5 practice test experiments that utilized this framework.
Bayesian inference for doubly intractable distributions is challenging because they include intractable terms, which are functions of parameters of interest. Although several alternatives have been developed for such models, they are computationally intensive due to repeated auxiliary variable simulations. We propose a novel Monte Carlo Stein variational gradient descent (MC-SVGD) approach for inference for doubly intractable distributions. Through an efficient gradient approximation, our MC-SVGD approach rapidly transforms an arbitrary reference distribution to approximate the posterior distribution of interest, without necessitating any predefined variational distribution class for the posterior. Such a transport map is obtained by minimizing Kullback-Leibler divergence between the transformed and posterior distributions in a reproducing kernel Hilbert space (RKHS). We also investigate the convergence rate of the proposed method. We illustrate the application of the method to challenging examples, including a Potts model, an exponential random graph model, and a Conway--Maxwell--Poisson regression model. The proposed method achieves substantial computational gains over existing algorithms, while providing comparable inferential performance for the posterior distributions.
We introduce the almost goodness-of-fit test, a procedure to decide if a (parametric) model provides a good representation of the probability distribution generating the observed sample. We consider the approximate model determined by an M-estimator of the parameters as the best representative of the unknown distribution within the parametric class. The objective is the approximate validation of a distribution or an entire parametric family up to a pre-specified threshold value, the margin of error. The methodology also allows quantifying the percentage improvement of the proposed model compared to a non-informative (constant) one. The test statistic is the $\mathrm{L}^p$-distance between the empirical distribution function and the corresponding one of the estimated (parametric) model. The value of the parameter $p$ allows modulating the impact of the tails of the distribution in the validation of the model. By deriving the asymptotic distribution of the test statistic, as well as proving the consistency of its bootstrap approximation, we present an easy-to-implement and flexible method. The performance of the proposal is illustrated with a simulation study and the analysis of a real dataset.
In the literature on stochastic frontier models until the early 2000s, the joint consideration of spatial and temporal dimensions was often inadequately addressed, if not completely neglected. However, from an evolutionary economics perspective, the production process of the decision-making units constantly changes over both dimensions: it is not stable over time due to managerial enhancements and/or internal or external shocks, and is influenced by the nearest territorial neighbours. This paper proposes an extension of the Fusco and Vidoli [2013] SEM-like approach, which globally accounts for spatial and temporal effects in the term of inefficiency. In particular, coherently with the stochastic panel frontier literature, two different versions of the model are proposed: the time-invariant and the time-varying spatial stochastic frontier models. In order to evaluate the inferential properties of the proposed estimators, we first run Monte Carlo experiments and we then present the results of an application to a set of commonly referenced data, demonstrating robustness and stability of estimates across all scenarios.
The standard paradigm for confirmatory clinical trials is to compare experimental treatments with a control, for example the standard of care or a placebo. However, it is not always the case that a suitable control exists. Efficient statistical methodology is well studied in the setting of randomised controlled trials. This is not the case if one wishes to compare several experimental with no control arm. We propose hypothesis testing methods suitable for use in such a setting. These methods are efficient, ensuring the error rate is controlled at exactly the desired rate with no conservatism. This in turn yields an improvement in power when compared with standard methods one might otherwise consider using, such as a Bonferroni adjustment. The proposed testing procedure is also highly flexible. We show how it may be extended for use in multi-stage adaptive trials, covering the majority of scenarios in which one might consider the use of such procedures in the clinical trials setting. With such a highly flexible nature, these methods may also be applied more broadly outside of a clinical trials setting.
Quantifying uncertainty in networks is an important step in modelling relationships and interactions between entities. We consider the challenge of bootstrapping an inhomogeneous random graph when only a single observation of the network is made and the underlying data generating function is unknown. We utilise an exchangeable network test that can empirically validate bootstrap samples generated by any method, by testing if the observed and bootstrapped networks are statistically distinguishable. We find that existing methods fail this test. To address this, we propose a principled, novel, distribution-free network bootstrap using k-nearest neighbour smoothing, that can regularly pass this exchangeable network test in both synthetic and real-data scenarios. We demonstrate the utility of this work in combination with the popular data visualisation method t-SNE, where uncertainty estimates from bootstrapping are used to explain whether visible structures represent real statistically sound structures.
This paper studies stable learning methods for generative models that enable high-quality data generation. Noise injection is commonly used to stabilize learning. However, selecting a suitable noise distribution is challenging. Diffusion-GAN, a recently developed method, addresses this by using the diffusion process with a timestep-dependent discriminator. We investigate Diffusion-GAN and reveal that data scaling is a key component for stable learning and high-quality data generation. Building on our findings, we propose a learning algorithm, Scale-GAN, that uses data scaling and variance-based regularization. Furthermore, we theoretically prove that data scaling controls the bias-variance trade-off of the estimation error bound. As a theoretical extension, we consider GAN with invertible data augmentations. Comparative evaluations on benchmark datasets demonstrate the effectiveness of our method in improving stability and accuracy.
In statistical inference, we commonly assume that samples are independent and identically distributed from a probability distribution included in a pre-specified statistical model. However, such an assumption is often violated in practice. Even an unexpected extreme sample called an {\it outlier} can significantly impact classical estimators. Robust statistics studies how to construct reliable statistical methods that efficiently work even when the ideal assumption is violated. Recently, some works revealed that robust estimators such as Tukey's median are well approximated by the generative adversarial net (GAN), a popular learning method for complex generative models using neural networks. GAN is regarded as a learning method using integral probability metrics (IPM), which is a discrepancy measure for probability distributions. In most theoretical analyses of Tukey's median and its GAN-based approximation, however, the Gaussian or elliptical distribution is assumed as the statistical model. In this paper, we explore the application of GAN-like estimators to a general class of statistical models. As the statistical model, we consider the kernel exponential family that includes both finite and infinite-dimensional models. To construct a robust estimator, we propose the smoothed total variation (STV) distance as a class of IPMs. Then, we theoretically investigate the robustness properties of the STV-based estimators. Our analysis reveals that the STV-based estimator is robust against the distribution contamination for the kernel exponential family. Furthermore, we analyze the prediction accuracy of a Monte Carlo approximation method, which circumvents the computational difficulty of the normalization constant.
Non-Gaussian likelihoods are essential for modelling complex real-world observations but pose significant computational challenges in learning and inference. Even with Gaussian priors, non-Gaussian likelihoods often lead to analytically intractable posteriors, necessitating approximation methods. To this end, we propose efficient schemes to approximate the effects of non-Gaussian likelihoods by Gaussian densities based on variational inference and moment matching in transformed bases. These enable efficient inference strategies originally designed for models with a Gaussian likelihood to be deployed. Our empirical results demonstrate that the proposed matching strategies attain good approximation quality for binary and multiclass classification in large-scale point-estimate and distributional inferential settings. In challenging streaming problems, the proposed methods outperform all existing likelihood approximations and approximate inference methods in the exact models. As a by-product, we show that the proposed approximate log-likelihoods are a superior alternative to least-squares on raw labels for neural network classification.
In this paper, an analysis of hourly air temperatures in four groups of 32 stations of the UK highland (five stations), UK lowland (four stations), Italian highland (eleven stations), and Italian lowland (twelve stations) at various altitudes was conducted over the period from 2002 to 2021. The study aimed to examine the trends of each hour of the day in that period, over different averaging time windows (-10 day, -30 day, and -60 day). The trends were computed using the Mann-Kendall trend test and Sen's slope estimator. The similarity of trends within and across the groups of stations was assessed using the hierarchical clustering with dynamic time warping technique. An additional analysis was conducted to show the correlation of trends among the group of stations using the correlation distance matrix. Hierarchical clustering and distance correlation analysis show trend similarities and correlations, also indicating dissimilarities among different groups. Using 30 day averages, significant warming trends in specific months at the Italian stations are evident, especially in February, July, August, and December. The UK highland stations did not show statistically significant trends, but clear pattern similarities were found within the groups, especially in certain months. The ultimate goal of this paper is to provide insights into temperature dynamics and climate change characteristics on regional and diurnal scales.
Large-scale datasets with count outcome variables are widely present in various applications, and the Poisson regression model is among the most popular models for handling count outcomes. This paper considers the high-dimensional sparse Poisson regression model and proposes bias-corrected estimators for both linear and quadratic transformations of high-dimensional regression vectors. We establish the asymptotic normality of the estimators, construct asymptotically valid confidence intervals, and conduct related hypothesis testing. We apply the devised methodology to high-dimensional mediation analysis with count outcome, with particular application of testing for the existence of interaction between the treatment variable and high-dimensional mediators. We demonstrate the proposed methods through extensive simulation studies and application to real-world epigenetic data.
Federated Learning (FL) has emerged as a groundbreaking paradigm in collaborative machine learning, emphasizing decentralized model training to address data privacy concerns. While significant progress has been made in optimizing federated learning, the exploration of generalization error, particularly in heterogeneous settings, has been limited, focusing mainly on parametric cases. This paper investigates the generalization properties of deep federated regression within a two-stage sampling model. Our findings highlight that the intrinsic dimension, defined by the entropic dimension, is crucial for determining convergence rates when appropriate network sizes are used. Specifically, if the true relationship between response and explanatory variables is charecterized by a $\beta$-Hölder function and there are $n$ independent and identically distributed (i.i.d.) samples from $m$ participating clients, the error rate for participating clients scales at most as $\tilde{O}\left((mn)^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))}\right)$, and for non-participating clients, it scales as $\tilde{O}\left(\Delta \cdot m^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))} + (mn)^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))}\right)$. Here, $\bar{d}_{2\beta}(\lambda)$ represents the $2\beta$-entropic dimension of $\lambda$, the marginal distribution of the explanatory variables, and $\Delta$ characterizes the dependence between the sampling stages. Our results explicitly account for the "closeness" of clients, demonstrating that the convergence rates of deep federated learners depend on intrinsic rather than nominal high-dimensionality.
We consider the injectivity property of the ReLU networks layers. Determining the ReLU injectivity capacity (ratio of the number of layer's inputs and outputs) is established as isomorphic to determining the capacity of the so-called $\ell_0$ spherical perceptron. Employing \emph{fully lifted random duality theory} (fl RDT) a powerful program is developed and utilized to handle the $\ell_0$ spherical perceptron and implicitly the ReLU layers injectivity. To put the entire fl RDT machinery in practical use, a sizeable set of numerical evaluations is conducted as well. The lifting mechanism is observed to converge remarkably fast with relative corrections in the estimated quantities not exceeding $\sim 0.1\%$ already on the third level of lifting. Closed form explicit analytical relations among key lifting parameters are uncovered as well. In addition to being of incredible importance in handling all the required numerical work, these relations also shed a new light on beautiful parametric interconnections within the lifting structure. Finally, the obtained results are also shown to fairly closely match the replica predictions from [40].
In astronomical observations, the estimation of distances from parallaxes is a challenging task due to the inherent measurement errors and the non-linear relationship between the parallax and the distance. This study leverages ideas from robust Bayesian inference to tackle these challenges, investigating a broad class of prior densities for estimating distances with a reduced bias and variance. Through theoretical analysis, simulation experiments, and the application to data from the Gaia Data Release 1 (GDR1), we demonstrate that heavy-tailed priors provide more reliable distance estimates, particularly in the presence of large fractional parallax errors. Theoretical results highlight the "curse of a single observation," where the likelihood dominates the posterior, limiting the impact of the prior. Nevertheless, heavy-tailed priors can delay the explosion of posterior risk, offering a more robust framework for distance estimation. The findings suggest that reciprocal invariant priors, with polynomial decay in their tails, such as the Half-Cauchy and Product Half-Cauchy, are particularly well-suited for this task, providing a balance between bias reduction and variance control.
Bandit algorithms have garnered significant attention due to their practical applications in real-world scenarios. However, beyond simple settings such as multi-arm or linear bandits, optimal algorithms remain scarce. Notably, no optimal solution exists for pure exploration problems in the context of generalized linear model (GLM) bandits. In this paper, we narrow this gap and develop the first track-and-stop algorithm for general pure exploration problems under the logistic bandit called logistic track-and-stop (Log-TS). Log-TS is an efficient algorithm that asymptotically matches an approximation for the instance-specific lower bound of the expected sample complexity up to a logarithmic factor.
The purpose of this paper is to answer a few open questions in the interface of kernel methods and PDE gradient flows. Motivated by recent advances in machine learning, particularly in generative modeling and sampling, we present a rigorous investigation of Fisher-Rao and Wasserstein type gradient flows concerning their gradient structures, flow equations, and their kernel approximations. Specifically, we focus on the Fisher-Rao (also known as Hellinger) geometry and its various kernel-based approximations, developing a principled theoretical framework using tools from PDE gradient flows and optimal transport theory. We also provide a complete characterization of gradient flows in the maximum-mean discrepancy (MMD) space, with connections to existing learning and inference algorithms. Our analysis reveals precise theoretical insights linking Fisher-Rao flows, Stein flows, kernel discrepancies, and nonparametric regression. We then rigorously prove evolutionary $\Gamma$-convergence for kernel-approximated Fisher-Rao flows, providing theoretical guarantees beyond pointwise convergence. Finally, we analyze energy dissipation using the Helmholtz-Rayleigh principle, establishing important connections between classical theory in mechanics and modern machine learning practice. Our results provide a unified theoretical foundation for understanding and analyzing approximations of gradient flows in machine learning applications through a rigorous gradient flow and variational method perspective.
Research on modeling the distributional aspects in sensor-based digital health (sDHT) data has grown significantly in recent years. Most existing approaches focus on using individual-specific density or quantile functions. However, there has been limited exploration to assess the practical utility of alternative distributional representations in clinical contexts collecting sDHT data. This study is motivated by accelerometry data collected on 246 individuals with multiple sclerosis (MS) representing a wide range of disability (Expanded Disability Status Scale, EDSS: 0-7). We consider five different individual-level distributional representations of minute-level activity counts: density, survival, hazard, quantile, and total time on test functions. For each of the five distributional representations, scalar-on-function regression fits linear discriminators for binary and continuously measured MS disability, and cross-validated discriminatory performance of these linear discriminators is compared across. The results show that individual-level hazard functions provide the highest discriminatory accuracy, more than double the accuracy compared to density functions. Individual-level quantile functions provided the second-highest discriminatory accuracy. These findings highlight the importance of focusing on distributional representations that capture the tail behavior of distributions when analyzing digital health data, especially in clinical contexts.
In the context of paid research studies and clinical trials, budget considerations and patient sampling from available populations are subject to inherent constraints. We introduce the R package CDsampling, which is the first to our knowledge to integrate optimal design theories within the framework of constrained sampling. This package offers the possibility to find both D-optimal approximate and exact allocations for samplings with or without constraints. Additionally, it provides functions to find constrained uniform sampling as a robust sampling strategy when the model information is limited. To demonstrate its efficacy, we provide simulated examples and a real-data example with datasets embedded in the package and compare them with classical sampling methods. Furthermore, it revisits the theoretical results of the Fisher information matrix for generalized linear models (including regular linear regression model) and multinomial logistic models, offering functions for its computation.