https://papers.cool/arxiv/statStatistics2024-06-21T00:00:00+00:00python-feedgenCool Papers - Immersive Paper Discoveryhttps://papers.cool/arxiv/2406.12908Rating Multi-Modal Time-Series Forecasting Models (MM-TSFM) for Robustness Through a Causal Lens2024-06-21T00:00:00+00:00Kausik LakkarajuRachneet KaurZhen ZengParisa ZehtabiSunandita PatraBiplav SrivastavaMarco ValtortaAI systems are notorious for their fragility; minor input changes can potentially cause major output swings. When such systems are deployed in critical areas like finance, the consequences of their uncertain behavior could be severe. In this paper, we focus on multi-modal time-series forecasting, where imprecision due to noisy or incorrect data can lead to erroneous predictions, impacting stakeholders such as analysts, investors, and traders. Recently, it has been shown that beyond numeric data, graphical transformations can be used with advanced visual models to achieve better performance. In this context, we introduce a rating methodology to assess the robustness of Multi-Modal Time-Series Forecasting Models (MM-TSFM) through causal analysis, which helps us understand and quantify the isolated impact of various attributes on the forecasting accuracy of MM-TSFM. We apply our novel rating method on a variety of numeric and multi-modal forecasting models in a large experimental setup (six input settings of control and perturbations, ten data distributions, time series from six leading stocks in three industries over a year of data, and five time-series forecasters) to draw insights on robust forecasting models and the context of their strengths. Within the scope of our study, our main result is that multi-modal (numeric + visual) forecasting, which was found to be more accurate than numeric forecasting in previous studies, can also be more robust in diverse settings. Our work will help different stakeholders of time-series forecasting understand the models` behaviors along trust (robustness) and accuracy dimensions to select an appropriate model for forecasting using our rating method, leading to improved decision-making.https://papers.cool/arxiv/2406.12911The Promise of Analog Deep Learning: Recent Advances, Challenges and Opportunities2024-06-21T00:00:00+00:00Aditya DatarPramit SahaMuch of the present-day Artificial Intelligence (AI) utilizes artificial neural networks, which are sophisticated computational models designed to recognize patterns and solve complex problems by learning from data. However, a major bottleneck occurs during a device's calculation of weighted sums for forward propagation and optimization procedure for backpropagation, especially for deep neural networks, or networks with numerous layers. Exploration into different methods of implementing neural networks is necessary for further advancement of the area. While a great deal of research into AI hardware in both directions, analog and digital implementation widely exists, much of the existing survey works lacks discussion on the progress of analog deep learning. To this end, we attempt to evaluate and specify the advantages and disadvantages, along with the current progress with regards to deep learning, for analog implementations. In this paper, our focus lies on the comprehensive examination of eight distinct analog deep learning methodologies across multiple key parameters. These parameters include attained accuracy levels, application domains, algorithmic advancements, computational speed, and considerations of energy efficiency and power consumption. We also identify the neural network-based experiments implemented using these hardware devices and discuss comparative performance achieved by the different analog deep learning methods along with an analysis of their current limitations. Overall, we find that Analog Deep Learning has great potential for future consumer-level applications, but there is still a long road ahead in terms of scalability. Most of the current implementations are more proof of concept and are not yet practically deployable for large-scale models.https://papers.cool/arxiv/2406.12916Opening the Black Box: predicting the trainability of deep neural networks with reconstruction entropy2024-06-21T00:00:00+00:00Yanick ThurnRo JeffersonJohanna ErdmengerAn important challenge in machine learning is to predict the initial conditions under which a given neural network will be trainable. We present a method for predicting the trainable regime in parameter space for deep feedforward neural networks, based on reconstructing the input from subsequent activation layers via a cascade of single-layer auxiliary networks. For both MNIST and CIFAR10, we show that a single epoch of training of the shallow cascade networks is sufficient to predict the trainability of the deep feedforward network, thereby providing a significant reduction in overall training time. We achieve this by computing the relative entropy between reconstructed images and the original inputs, and show that this probe of information loss is sensitive to the phase behaviour of the network. Our results provide a concrete link between the flow of information and the trainability of deep neural networks, further elucidating the role of criticality in these systems.https://papers.cool/arxiv/2406.12945Under the Hood of Tabular Data Generation Models: the Strong Impact of Hyperparameter Tuning2024-06-21T00:00:00+00:00G. Charbel N. KindjiLina Maria Rojas-BarahonaElisa FromontTanguy UrvoyWe investigate the impact of dataset-specific hyperparameter, feature encoding, and architecture tuning on five recent model families for tabular data generation through an extensive benchmark on 16 datasets. This study addresses the practical need for a unified evaluation of models that fully considers hyperparameter optimization. Additionally, we propose a reduced search space for each model that allows for quick optimization, achieving nearly equivalent performance at a significantly lower cost.Our benchmark demonstrates that, for most models, large-scale dataset-specific tuning substantially improves performance compared to the original configurations. Furthermore, we confirm that diffusion-based models generally outperform other models on tabular data. However, this advantage is not significant when the entire tuning and training process is restricted to the same GPU budget for all models.https://papers.cool/arxiv/2406.12949Integrating time-resolved $nrf2$ gene-expression data into a full GUTS model as a proxy for toxicodynamic damage in zebrafish embryo2024-06-21T00:00:00+00:00Florian SchunckBernhard KodritschWibke BuschMartin KraussAndreas FocksThe immense production of the chemical industry requires an improved predictive risk assessment that can handle constantly evolving challenges while reducing the dependency of risk assessment on animal testing. Integrating 'omics data into mechanistic models offers a promising solution by linking cellular processes triggered after chemical exposure with observed effects in the organism. With the emerging availability of time-resolved RNA data, the goal of integrating gene expression data into mechanistic models can be approached. We propose a biologically anchored TKTD model, which describes key processes that link the gene expression level of the stress regulator $nrf2$ to detoxification and lethality by associating toxicodynamic damage with $nrf2$ expression. Fitting such a model to complex datasets consisting of multiple endpoints required the combination of methods from molecular biology, mechanistic dynamic systems modeling and Bayesian inference. In this study we successfully integrate time-resolved gene expression data into TKTD models, and thus provide a method for assessing the influence of molecular markers on survival. This novel method was used to test whether, $nrf2$, can be applied to predict lethality in zebrafish embryos. With the presented approach we outline a method to successively approach the goal of a predictive risk assessment based on molecular data.https://papers.cool/arxiv/2406.13012Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models2024-06-21T00:00:00+00:00Joshua WardChi-Hua WangGuang ChengThe promise of tabular generative models is to produce realistic synthetic data that can be shared and safely used without dangerous leakage of information from the training set. In evaluating these models, a variety of methods have been proposed to measure the tendency to copy data from the training dataset when generating a sample. However, these methods suffer from either not considering data-copying from a privacy threat perspective, not being motivated by recent results in the data-copying literature or being difficult to make compatible with the high dimensional, mixed type nature of tabular data. This paper proposes a new similarity metric and Membership Inference Attack called Data Plagiarism Index (DPI) for tabular data. We show that DPI evaluates a new intuitive definition of data-copying and characterizes the corresponding privacy risk. We show that the data-copying identified by DPI poses both privacy and fairness threats to common, high performing architectures; underscoring the necessity for more sophisticated generative modeling techniques to mitigate this issue.https://papers.cool/arxiv/2406.13060Scale-Translation Equivariant Network for Oceanic Internal Solitary Wave Localization2024-06-21T00:00:00+00:00Zhang WanShuo WangXudong ZhangInternal solitary waves (ISWs) are gravity waves that are often observed in the interior ocean rather than the surface. They hold significant importance due to their capacity to carry substantial energy, thus influence pollutant transport, oil platform operations, submarine navigation, etc. Researchers have studied ISWs through optical images, synthetic aperture radar (SAR) images, and altimeter data from remote sensing instruments. However, cloud cover in optical remote sensing images variably obscures ground information, leading to blurred or missing surface observations. As such, this paper aims at altimeter-based machine learning solutions to automatically locate ISWs. The challenges, however, lie in the following two aspects: 1) the altimeter data has low resolution, which requires a strong machine learner; 2) labeling data is extremely labor-intensive, leading to very limited data for training. In recent years, the grand progress of deep learning demonstrates strong learning capacity given abundant data. Besides, more recent studies on efficient learning and self-supervised learning laid solid foundations to tackle the aforementioned challenges. In this paper, we propose to inject prior knowledge to achieve a strong and efficient learner. Specifically, intrinsic patterns in altimetry data are efficiently captured using a scale-translation equivariant convolutional neural network (ST-ECNN). By considering inherent symmetries in neural network design, ST-ECNN achieves higher efficiency and better performance than baseline models. Furthermore, we also introduce prior knowledge from massive unsupervised data to enhance our solution using the SimCLR framework for pre-training. Our final solution achieves an overall better performance than baselines on our handcrafted altimetry dataset. Data and codes are available at https://github.com/ZhangWan-byte/Internal_Solitary_Wave_Localization .https://papers.cool/arxiv/2406.13074PIPPIN: Generating variable length full events from partons2024-06-21T00:00:00+00:00Guillaume QuétantJohn Andrew RaineMatthew LeighDebajyoti SenguptaTobias GollingThis paper presents a novel approach for directly generating full events at detector-level from parton-level information, leveraging cutting-edge machine learning techniques. To address the challenge of multiplicity variations between parton and reconstructed object spaces, we employ transformers, score-based models and normalizing flows. Our method tackles the inherent complexities of the stochastic transition between these two spaces and achieves remarkably accurate results. The combination of innovative techniques and the achieved accuracy demonstrates the potential of our approach in advancing the field and opens avenues for further exploration. This research contributes to the ongoing efforts in high-energy physics and generative modelling, providing a promising direction for enhanced precision in fast detector simulation.https://papers.cool/arxiv/2406.13130Advancing Retail Data Science: Comprehensive Evaluation of Synthetic Data2024-06-21T00:00:00+00:00Yu XiaChi-Hua WangJoshua MabryGuang ChengThe evaluation of synthetic data generation is crucial, especially in the retail sector where data accuracy is paramount. This paper introduces a comprehensive framework for assessing synthetic retail data, focusing on fidelity, utility, and privacy. Our approach differentiates between continuous and discrete data attributes, providing precise evaluation criteria. Fidelity is measured through stability and generalizability. Stability ensures synthetic data accurately replicates known data distributions, while generalizability confirms its robustness in novel scenarios. Utility is demonstrated through the synthetic data's effectiveness in critical retail tasks such as demand forecasting and dynamic pricing, proving its value in predictive analytics and strategic planning. Privacy is safeguarded using Differential Privacy, ensuring synthetic data maintains a perfect balance between resembling training and holdout datasets without compromising security. Our findings validate that this framework provides reliable and scalable evaluation for synthetic retail data. It ensures high fidelity, utility, and privacy, making it an essential tool for advancing retail data science. This framework meets the evolving needs of the retail industry with precision and confidence, paving the way for future advancements in synthetic data methodologies.https://papers.cool/arxiv/2406.13371Identifiable Causal Representation Learning: Unsupervised, Multi-View, and Multi-Environment2024-06-21T00:00:00+00:00Julius von KügelgenCausal models provide rich descriptions of complex systems as sets of mechanisms by which each variable is influenced by its direct causes. They support reasoning about manipulating parts of the system and thus hold promise for addressing some of the open challenges of artificial intelligence (AI), such as planning, transferring knowledge in changing environments, or robustness to distribution shifts. However, a key obstacle to more widespread use of causal models in AI is the requirement that the relevant variables be specified a priori, which is typically not the case for the high-dimensional, unstructured data processed by modern AI systems. At the same time, machine learning (ML) has proven quite successful at automatically extracting useful and compact representations of such complex data. Causal representation learning (CRL) aims to combine the core strengths of ML and causality by learning representations in the form of latent variables endowed with causal model semantics. In this thesis, we study and present new results for different CRL settings. A central theme is the question of identifiability: Given infinite data, when are representations satisfying the same learning objective guaranteed to be equivalent? This is an important prerequisite for CRL, as it formally characterises if and when a learning task is, at least in principle, feasible. Since learning causal models, even without a representation learning component, is notoriously difficult, we require additional assumptions on the model class or rich data beyond the classical i.i.d. setting. By partially characterising identifiability for different settings, this thesis investigates what is possible for CRL without direct supervision, and thus contributes to its theoretical foundations. Ideally, the developed insights can help inform data collection practices or inspire the design of new practical estimation methods.https://papers.cool/arxiv/2406.13493In-Context In-Context Learning with Transformer Neural Processes2024-06-21T00:00:00+00:00Matthew AshmanCristiana DiaconuAdrian WellerRichard E. TurnerNeural processes (NPs) are a powerful family of meta-learning models that seek to approximate the posterior predictive map of the ground-truth stochastic process from which each dataset in a meta-dataset is sampled. There are many cases in which practitioners, besides having access to the dataset of interest, may also have access to other datasets that share similarities with it. In this case, integrating these datasets into the NP can improve predictions. We equip NPs with this functionality and describe this paradigm as in-context in-context learning. Standard NP architectures, such as the convolutional conditional NP (ConvCNP) or the family of transformer neural processes (TNPs), are not capable of in-context in-context learning, as they are only able to condition on a single dataset. We address this shortcoming by developing the in-context in-context learning pseudo-token TNP (ICICL-TNP). The ICICL-TNP builds on the family of PT-TNPs, which utilise pseudo-token-based transformer architectures to sidestep the quadratic computational complexity associated with regular transformer architectures. Importantly, the ICICL-TNP is capable of conditioning on both sets of datapoints and sets of datasets, enabling it to perform in-context in-context learning. We demonstrate the importance of in-context in-context learning and the effectiveness of the ICICL-TNP in a number of experiments.https://papers.cool/arxiv/2406.13668Improved bounds for calibration via stronger sign preservation games2024-06-21T00:00:00+00:00Yuval DaganConstantinos DaskalakisMaxwell FishelsonNoah GolowichRobert KleinbergPrincewill OkoroaforA set of probabilistic forecasts is calibrated if each prediction of the forecaster closely approximates the empirical distribution of outcomes on the subset of timesteps where that prediction was made. We study the fundamental problem of online calibrated forecasting of binary sequences, which was initially studied by Foster & Vohra (1998). They derived an algorithm with $O(T^{2/3})$ calibration error after $T$ time steps, and showed a lower bound of $\Omega(T^{1/2})$. These bounds remained stagnant for two decades, until Qiao & Valiant (2021) improved the lower bound to $\Omega(T^{0.528})$ by introducing a combinatorial game called sign preservation and showing that lower bounds for this game imply lower bounds for calibration. We introduce a strengthening of Qiao & Valiant's game that we call sign preservation with reuse (SPR). We prove that the relationship between SPR and calibrated forecasting is bidirectional: not only do lower bounds for SPR translate into lower bounds for calibration, but algorithms for SPR also translate into new algorithms for calibrated forecasting. In particular, any strategy that improves the trivial upper bound for the value of the SPR game would imply a forecasting algorithm with calibration error exponent less than 2/3, improving Foster & Vohra's upper bound for the first time. Using similar ideas, we then prove a slightly stronger lower bound than that of Qiao & Valiant, namely $\Omega(T^{0.54389})$. Our lower bound is obtained by an oblivious adversary, marking the first $\omega(T^{1/2})$ calibration lower bound for oblivious adversaries.https://papers.cool/arxiv/2406.13725Tree-Sliced Wasserstein Distance on a System of Lines2024-06-21T00:00:00+00:00Viet-Hoang TranTrang PhamTho TranTam LeTan M. NguyenSliced Wasserstein (SW) distance in Optimal Transport (OT) is widely used in various applications thanks to its statistical effectiveness and computational efficiency. On the other hand, Tree Wassenstein (TW) and Tree-sliced Wassenstein (TSW) are instances of OT for probability measures where its ground cost is a tree metric. TSW also has a low computational complexity, i.e. linear to the number of edges in the tree. Especially, TSW is identical to SW when the tree is a chain. While SW is prone to loss of topological information of input measures due to relying on one-dimensional projection, TSW is more flexible and has a higher degree of freedom by choosing a tree rather than a line to alleviate the curse of dimensionality in SW. However, for practical applications, popular tree metric sampling methods are heavily built upon given supports, which limits their capacity to adapt to new supports. In this paper, we propose the Tree-Sliced Wasserstein distance on a System of Lines (TSW-SL), which brings a connection between SW and TSW. Compared to SW and TSW, our TSW-SL benefits from the higher degree of freedom of TSW while being suitable to dynamic settings as SW. In TSW-SL, we use a variant of the Radon Transform to project measures onto a system of lines, resulting in measures on a space with a tree metric, then leverage TW to efficiently compute distances between them. We empirically verify the advantages of TSW-SL over the traditional SW by conducting a variety of experiments on gradient flows, image style transfer, and generative models.https://papers.cool/arxiv/2406.13762Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis2024-06-21T00:00:00+00:00Rachel S. Y. TeoTan M. NguyenThe remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of self-attention. Similar to the development of most deep learning models, the construction of these attention mechanisms rely on heuristics and experience. In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. We then formulate the exact formula for the value matrix in self-attention, theoretically and empirically demonstrating that this value matrix captures the eigenvectors of the Gram matrix of the key vectors in self-attention. Leveraging our kernel PCA framework, we propose Attention with Robust Principal Components (RPC-Attention), a novel class of robust attention that is resilient to data contamination. We empirically demonstrate the advantages of RPC-Attention over softmax attention on the ImageNet-1K object classification, WikiText-103 language modeling, and ADE20K image segmentation task.https://papers.cool/arxiv/2406.13770Elliptical Attention2024-06-21T00:00:00+00:00Stefan K. NielsenLaziz U. AbdullaevRachel TeoTan M. NguyenPairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision. This dot-product self-attention computes attention weights among the input tokens using Euclidean distance, which makes the model prone to representation collapse and vulnerable to contaminated samples. In this paper, we propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance. In particular, we define a hyper-ellipsoidal neighborhood around each query to increase the attention weights of the tokens lying in the contextually important directions. We term this novel class of attention Elliptical Attention. Our Elliptical Attention provides two benefits: 1) reducing representation collapse and 2) enhancing the model's robustness as the Elliptical Attention pays more attention to contextually relevant information rather than focusing on some small subset of informative features. We empirically demonstrate the advantages of Elliptical Attention over the baseline dot-product attention and state-of-the-art attention methods on various practical tasks, including object classification, image segmentation, and language modeling across different data modalities.https://papers.cool/arxiv/2406.13781A Primal-Dual Framework for Transformers and Neural Networks2024-06-21T00:00:00+00:00Tan M. NguyenTam NguyenNhat HoAndrea L. BertozziRichard G. BaraniukStanley J. OsherSelf-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification.https://papers.cool/arxiv/2406.13822Association of neighborhood disadvantage with cognitive function and cortical disorganization in an unimpaired cohort2024-06-21T00:00:00+00:00Apoorva SafaiErin JonaitisRebecca E LanghoughWilliam R BuckinghamSterling C. JohnsonW. Ryan PowellAmy J. H. KindBarbara B. BendlinPallavi TiwariNeighborhood disadvantage is associated with worse health and cognitive outcomes. Morphological similarity network (MSN) is a promising approach to elucidate cortical network patterns underlying complex cognitive functions. We hypothesized that MSNs could capture changes in cortical patterns related to neighborhood disadvantage and cognitive function. This cross-sectional study included cognitively unimpaired participants from two large Alzheimers studies at University of Wisconsin-Madison. Neighborhood disadvantage status was obtained using the Area Deprivation Index (ADI). Cognitive performance was assessed on memory, processing speed and executive function. Morphological Similarity Networks (MSN) were constructed for each participant based on the similarity in distribution of cortical thickness of brain regions, followed by computation of local and global network features. Association of ADI with cognitive scores and MSN features were examined using linear regression and mediation analysis. ADI showed negative association with category fluency,implicit learning speed, story recall and modified pre-clinical Alzheimers cognitive composite scores, indicating worse cognitive function among those living in more disadvantaged neighborhoods. Local network features of frontal and temporal regions differed based on ADI status. Centrality of left lateral orbitofrontal region showed a partial mediating effect between association of neighborhood disadvantage and story recall performance. Our preliminary findings suggest differences in local cortical organization by neighborhood disadvantage, which partially mediated the relationship between ADI and cognitive performance, providing a possible network-based mechanism to, in-part, explain the risk for poor cognitive functioning associated with disadvantaged neighborhoods.https://papers.cool/arxiv/2406.13826Testing identification in mediation and dynamic treatment models2024-06-21T00:00:00+00:00Martin HuberKevin KloiberLukas LaffersWe propose a test for the identification of causal effects in mediation and dynamic treatment models that is based on two sets of observed variables, namely covariates to be controlled for and suspected instruments, building on the test by Huber and Kueck (2022) for single treatment models. We consider models with a sequential assignment of a treatment and a mediator to assess the direct treatment effect (net of the mediator), the indirect treatment effect (via the mediator), or the joint effect of both treatment and mediator. We establish testable conditions for identifying such effects in observational data. These conditions jointly imply (1) the exogeneity of the treatment and the mediator conditional on covariates and (2) the validity of distinct instruments for the treatment and the mediator, meaning that the instruments do not directly affect the outcome (other than through the treatment or mediator) and are unconfounded given the covariates. Our framework extends to post-treatment sample selection or attrition problems when replacing the mediator by a selection indicator for observing the outcome, enabling joint testing of the selectivity of treatment and attrition. We propose a machine learning-based test to control for covariates in a data-driven manner and analyze its finite sample performance in a simulation study. Additionally, we apply our method to Slovak labor market data and find that our testable implications are not rejected for a sequence of training programs typically considered in dynamic treatment evaluations.https://papers.cool/arxiv/2406.13966Causal Inference with Latent Variables: Recent Advances and Future Prospectives2024-06-21T00:00:00+00:00Yaochen ZhuYinhan HeJing MaMengxuan HuSheng LiJundong LiCausality lays the foundation for the trajectory of our world. Causal inference (CI), which aims to infer intrinsic causal relations among variables of interest, has emerged as a crucial research topic. Nevertheless, the lack of observation of important variables (e.g., confounders, mediators, exogenous variables, etc.) severely compromises the reliability of CI methods. The issue may arise from the inherent difficulty in measuring the variables. Additionally, in observational studies where variables are passively recorded, certain covariates might be inadvertently omitted by the experimenter. Depending on the type of unobserved variables and the specific CI task, various consequences can be incurred if these latent variables are carelessly handled, such as biased estimation of causal effects, incomplete understanding of causal mechanisms, lack of individual-level causal consideration, etc. In this survey, we provide a comprehensive review of recent developments in CI with latent variables. We start by discussing traditional CI techniques when variables of interest are assumed to be fully observed. Afterward, under the taxonomy of circumvention and inference-based methods, we provide an in-depth discussion of various CI strategies to handle latent variables, covering the tasks of causal effect estimation, mediation analysis, counterfactual reasoning, and causal discovery. Furthermore, we generalize the discussion to graph data where interference among units may exist. Finally, we offer fresh aspects for further advancement of CI with latent variables, especially new opportunities in the era of large language models (LLMs).https://papers.cool/arxiv/2406.14026Demystifying Forgetting in Language Model Fine-Tuning with Statistical Analysis of Example Associations2024-06-21T00:00:00+00:00Xisen JinXiang RenLanguage models (LMs) are known to suffer from forgetting of previously learned examples when fine-tuned, breaking stability of deployed LM systems. Despite efforts on mitigating forgetting, few have investigated whether, and how forgotten upstream examples are associated with newly learned tasks. Insights on such associations enable efficient and targeted mitigation of forgetting. In this paper, we empirically analyze forgetting that occurs in $N$ upstream examples while the model learns $M$ new tasks and visualize their associations with a $M \times N$ matrix. We empirically demonstrate that the degree of forgetting can often be approximated by simple multiplicative contributions of the upstream examples and newly learned tasks. We also reveal more complicated patterns where specific subsets of examples are forgotten with statistics and visualization. Following our analysis, we predict forgetting that happens on upstream examples when learning a new task with matrix completion over the empirical associations, outperforming prior approaches that rely on trainable LMs. Project website: https://inklab.usc.edu/lm-forgetting-prediction/https://papers.cool/arxiv/2406.14059Tracking solutions of time-varying variational inequalities2024-06-21T00:00:00+00:00Hédi HadijiSarah SachsCristóbal GuzmánTracking the solution of time-varying variational inequalities is an important problem with applications in game theory, optimization, and machine learning. Existing work considers time-varying games or time-varying optimization problems. For strongly convex optimization problems or strongly monotone games, these results provide tracking guarantees under the assumption that the variation of the time-varying problem is restrained, that is, problems with a sublinear solution path. In this work we extend existing results in two ways: In our first result, we provide tracking bounds for (1) variational inequalities with a sublinear solution path but not necessarily monotone functions, and (2) for periodic time-varying variational inequalities that do not necessarily have a sublinear solution path-length. Our second main contribution is an extensive study of the convergence behavior and trajectory of discrete dynamical systems of periodic time-varying VI. We show that these systems can exhibit provably chaotic behavior or can converge to the solution. Finally, we illustrate our theoretical results with experiments.https://papers.cool/arxiv/2406.14062An agent-based model of behaviour change calibrated to reversal learning data2024-06-21T00:00:00+00:00Roben Delos ReyesHugo Lyons KeenanCameron ZachresonBehaviour change lies at the heart of many observable collective phenomena such as the transmission and control of infectious diseases, adoption of public health policies, and migration of animals to new habitats. Representing the process of individual behaviour change in computer simulations of these phenomena remains an open challenge. Often, computational models use phenomenological implementations with limited support from behavioural data. Without a strong connection to observable quantities, such models have limited utility for simulating observed and counterfactual scenarios of emergent phenomena because they cannot be validated or calibrated. Here, we present a simple stochastic individual-based model of reversal learning that captures fundamental properties of individual behaviour change, namely, the capacity to learn based on accumulated reward signals, and the transient persistence of learned behaviour after rewards are removed or altered. The model has only two parameters, and we use approximate Bayesian computation to demonstrate that they are fully identifiable from empirical reversal learning time series data. Finally, we demonstrate how the model can be extended to account for the increased complexity of behavioural dynamics over longer time scales involving fluctuating stimuli. This work is a step towards the development and evaluation of fully identifiable individual-level behaviour change models that can function as validated submodels for complex simulations of collective behaviour change.https://papers.cool/arxiv/2406.14163A Unified Statistical And Computational Framework For Ex-Post Harmonisation Of Aggregate Statistics2024-06-21T00:00:00+00:00Cynthia A. HuangEx-post harmonisation is one of many data preprocessing processes used to combine the increasingly vast and diverse sources of data available for research and analysis. Documenting provenance and ensuring the quality of multi-source datasets is vital for ensuring trustworthy scientific research and encouraging reuse of existing harmonisation efforts. However, capturing and communicating statistically relevant properties of harmonised datasets is difficult without a universal standard for describing harmonisation operations. Our paper combines mathematical and computer science perspectives to address this need. The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard. It uses computational graphs to separate intended transformation logic from actual data transformations, and avoid the risk of syntactically valid data manipulation scripts resulting in statistically questionable data. In this paper, we introduce the Crossmaps Framework through the example of ex-post harmonisation of aggregated statistics in the social sciences. We define a new provenance task abstraction, the crossmap transform, and formalise two associated objects, the shared mass array and the crossmap. We further define graph, matrix and list encodings of crossmaps and discuss resulting implications for understanding statistical properties of ex-post harmonisation and designing error minimising workflows.https://papers.cool/arxiv/2406.14246Non-Negative Universal Differential Equations With Applications in Systems Biology2024-06-21T00:00:00+00:00Maren PhilippsAntonia KörnerJakob VanhoeferDilan PathiranaJan HasenauerUniversal differential equations (UDEs) leverage the respective advantages of mechanistic models and artificial neural networks and combine them into one dynamic model. However, these hybrid models can suffer from unrealistic solutions, such as negative values for biochemical quantities. We present non-negative UDE (nUDEs), a constrained UDE variant that guarantees non-negative values. Furthermore, we explore regularisation techniques to improve generalisation and interpretability of UDEs.https://papers.cool/arxiv/2406.14347$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials2024-06-21T00:00:00+00:00Kuzma KhrabrovAnton BerArtem TsypinKonstantin UsheninEgor RumiantsevAlexander TelepovDmitry ProtasovIlya ShenbinAnton AlekseevMikhail ShirokikhSergey NikolenkoElena TutubalinaArtur KadurinMethods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($\omega$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.https://papers.cool/arxiv/2406.14380Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach2024-06-21T00:00:00+00:00Ruohan ZhanShichao HanYuchen HuZhenling JiangRecommender systems are essential for content-sharing platforms by curating personalized content. To evaluate updates of recommender systems targeting content creators, platforms frequently engage in creator-side randomized experiments to estimate treatment effect, defined as the difference in outcomes when a new (vs. the status quo) algorithm is deployed on the platform. We show that the standard difference-in-means estimator can lead to a biased treatment effect estimate. This bias arises because of recommender interference, which occurs when treated and control creators compete for exposure through the recommender system. We propose a "recommender choice model" that captures how an item is chosen among a pool comprised of both treated and control content items. By combining a structural choice model with neural networks, the framework directly models the interference pathway in a microfounded way while accounting for rich viewer-content heterogeneity. Using the model, we construct a double/debiased estimator of the treatment effect that is consistent and asymptotically normal. We demonstrate its empirical performance with a field experiment on Weixin short-video platform: besides the standard creator-side experiment, we carry out a costly blocked double-sided randomization design to obtain a benchmark estimate without interference bias. We show that the proposed estimator significantly reduces the bias in treatment effect estimates compared to the standard difference-in-means estimator.https://papers.cool/arxiv/2406.14399WEATHER-5K: A Large-scale Global Station Weather Dataset Towards Comprehensive Time-series Forecasting Benchmark2024-06-21T00:00:00+00:00Tao HanSong GuoZhenghao ChenWanghan XuLei BaiGlobal Station Weather Forecasting (GSWF) is crucial for various sectors, including aviation, agriculture, energy, and disaster preparedness. Recent advancements in deep learning have significantly improved the accuracy of weather predictions by optimizing models based on public meteorological data. However, existing public datasets for GSWF optimization and benchmarking still suffer from significant limitations, such as small sizes, limited temporal coverage, and a lack of comprehensive variables. These shortcomings prevent them from effectively reflecting the benchmarks of current forecasting methods and fail to support the real needs of operational weather forecasting. To address these challenges, we present the WEATHER-5K dataset. This dataset comprises a comprehensive collection of data from 5,672 weather stations worldwide, spanning a 10-year period with one-hour intervals. It includes multiple crucial weather elements, providing a more reliable and interpretable resource for forecasting. Furthermore, our WEATHER-5K dataset can serve as a benchmark for comprehensively evaluating existing well-known forecasting models, extending beyond GSWF methods to support future time-series research challenges and opportunities. The dataset and benchmark implementation are publicly available at: https://github.com/taohan10200/WEATHER-5K.https://papers.cool/arxiv/2406.14469Fusion of Movement and Naive Predictions for Point Forecasting in Univariate Random Walks2024-06-21T00:00:00+00:00Cheng ZhangTraditional methods for point forecasting in univariate random walks often fail to surpass naive benchmarks due to data unpredictability. This study introduces a novel forecasting method that fuses movement prediction (binary classification) with naive forecasts for accurate one-step-ahead point forecasting. The method's efficacy is demonstrated through theoretical analysis, simulations, and real-world data experiments. It reliably exceeds naive forecasts with movement prediction accuracies as low as 0.55, outperforming baseline models like ARIMA, linear regression, MLP, and LSTM networks in forecasting the S\&P 500 index and Bitcoin prices. This method is particularly advantageous when accurate point predictions are challenging but accurate movement predictions are attainable, translating movement predictions into point forecasts in random walk contexts.https://papers.cool/arxiv/2406.13036Sharp detection of low-dimensional structure in probability measures via dimensional logarithmic Sobolev inequalities2024-06-21T00:00:00+00:00Matthew T. C. LiTiangang CuiFengyi LiYoussef MarzoukOlivier ZahmIdentifying low-dimensional structure in high-dimensional probability measures is an essential pre-processing step for efficient sampling. We introduce a method for identifying and approximating a target measure $\pi$ as a perturbation of a given reference measure $\mu$ along a few significant directions of $\mathbb{R}^{d}$. The reference measure can be a Gaussian or a nonlinear transformation of a Gaussian, as commonly arising in generative modeling. Our method extends prior work on minimizing majorizations of the Kullback--Leibler divergence to identify optimal approximations within this class of measures. Our main contribution unveils a connection between the \emph{dimensional} logarithmic Sobolev inequality (LSI) and approximations with this ansatz. Specifically, when the target and reference are both Gaussian, we show that minimizing the dimensional LSI is equivalent to minimizing the KL divergence restricted to this ansatz. For general non-Gaussian measures, the dimensional LSI produces majorants that uniformly improve on previous majorants for gradient-based dimension reduction. We further demonstrate the applicability of this analysis to the squared Hellinger distance, where analogous reasoning shows that the dimensional Poincar\'e inequality offers improved bounds.https://papers.cool/arxiv/2406.13052Distance Covariance, Independence, and Pairwise Differences2024-06-21T00:00:00+00:00Jakob RaymaekersPeter J. Rousseeuw(To appear in The American Statistician.) Distance covariance (Sz\'ekely, Rizzo, and Bakirov, 2007) is a fascinating recent notion, which is popular as a test for dependence of any type between random variables $X$ and $Y$. This approach deserves to be touched upon in modern courses on mathematical statistics. It makes use of distances of the type $|X-X'|$ and $|Y-Y'|$, where $(X',Y')$ is an independent copy of $(X,Y)$. This raises natural questions about independence of variables like $X-X'$ and $Y-Y'$, about the connection between Cov$(|X-X'|,|Y-Y'|)$ and the covariance between doubly centered distances, and about necessary and sufficient conditions for independence. We show some basic results and present a new and nontechnical counterexample to a common fallacy, which provides more insight. We also show some motivating examples involving bivariate distributions and contingency tables, which can be used as didactic material for introducing distance correlation.https://papers.cool/arxiv/2406.13111Nonparametric Motion Control in Functional Connectivity Studies in Children with Autism Spectrum Disorder2024-06-21T00:00:00+00:00Jialu RanSarah ShultzBenjamin B. RiskDavid BenkeserAutism Spectrum Disorder (ASD) is a neurodevelopmental condition associated with difficulties with social interactions, communication, and restricted or repetitive behaviors. To characterize ASD, investigators often use functional connectivity derived from resting-state functional magnetic resonance imaging of the brain. However, participants' head motion during the scanning session can induce motion artifacts. Many studies remove scans with excessive motion, which can lead to drastic reductions in sample size and introduce selection bias. To avoid such exclusions, we propose an estimand inspired by causal inference methods that quantifies the difference in average functional connectivity in autistic and non-ASD children while standardizing motion relative to the low motion distribution in scans that pass motion quality control. We introduce a nonparametric estimator for motion control, called MoCo, that uses all participants and flexibly models the impacts of motion and other relevant features using an ensemble of machine learning methods. We establish large-sample efficiency and multiple robustness of our proposed estimator. The framework is applied to estimate the difference in functional connectivity between 132 autistic and 245 non-ASD children, of which 34 and 126 pass motion quality control. MoCo appears to dramatically reduce motion artifacts relative to no participant removal, while more efficiently utilizing participant data and accounting for possible selection biases relative to the na\"ive approach with participant removal.https://papers.cool/arxiv/2406.13151von Mises Quasi-Processes for Bayesian Circular Regression2024-06-21T00:00:00+00:00Yarden CohenAlexandre Khae Wu NavarroJes FrellsenRichard E. TurnerRaziel RiemerAri PakmanThe need for regression models to predict circular values arises in many scientific fields. In this work we explore a family of expressive and interpretable distributions over circle-valued random functions related to Gaussian processes targeting two Euclidean dimensions conditioned on the unit circle. The resulting probability model has connections with continuous spin models in statistical physics. Moreover, its density is very simple and has maximum-entropy, unlike previous Gaussian process-based approaches, which use wrapping or radial marginalization. For posterior inference, we introduce a new Stratonovich-like augmentation that lends itself to fast Markov Chain Monte Carlo sampling. We argue that transductive learning in these models favors a Bayesian approach to the parameters. We present experiments applying this model to the prediction of (i) wind directions and (ii) the percentage of the running gait cycle as a function of joint angles.https://papers.cool/arxiv/2406.13154Conditional score-based diffusion models for solving inverse problems in mechanics2024-06-21T00:00:00+00:00Agnimitra DasguptaHarisankar RamaswamyJavier Murgoitio EsandiKen FooRunze LiQifa ZhouBrendan KennedyAssad OberaiWe propose a framework to perform Bayesian inference using conditional score-based diffusion models to solve a class of inverse problems in mechanics involving the inference of a specimen's spatially varying material properties from noisy measurements of its mechanical response to loading. Conditional score-based diffusion models are generative models that learn to approximate the score function of a conditional distribution using samples from the joint distribution. More specifically, the score functions corresponding to multiple realizations of the measurement are approximated using a single neural network, the so-called score network, which is subsequently used to sample the posterior distribution using an appropriate Markov chain Monte Carlo scheme based on Langevin dynamics. Training the score network only requires simulating the forward model. Hence, the proposed approach can accommodate black-box forward models and complex measurement noise. Moreover, once the score network has been trained, it can be re-used to solve the inverse problem for different realizations of the measurements. We demonstrate the efficacy of the proposed approach on a suite of high-dimensional inverse problems in mechanics that involve inferring heterogeneous material properties from noisy measurements. Some examples we consider involve synthetic data, while others include data collected from actual elastography experiments. Further, our applications demonstrate that the proposed approach can handle different measurement modalities, complex patterns in the inferred quantities, non-Gaussian and non-additive noise models, and nonlinear black-box forward models. The results show that the proposed framework can solve large-scale physics-based inverse problems efficiently.https://papers.cool/arxiv/2406.13197Representation Transfer Learning for Semiparametric Regression2024-06-21T00:00:00+00:00Baihua HeHuihang LiuXinyu ZhangJian HuangWe propose a transfer learning method that utilizes data representations in a semiparametric regression model. Our aim is to perform statistical inference on the parameter of primary interest in the target model while accounting for potential nonlinear effects of confounding variables. We leverage knowledge from source domains, assuming that the sample size of the source data is substantially larger than that of the target data. This knowledge transfer is carried out by the sharing of data representations, predicated on the idea that there exists a set of latent representations transferable from the source to the target domain. We address model heterogeneity between the source and target domains by incorporating domain-specific parameters in their respective models. We establish sufficient conditions for the identifiability of the models and demonstrate that the estimator for the primary parameter in the target model is both consistent and asymptotically normal. These results lay the theoretical groundwork for making statistical inferences about the main effects. Our simulation studies highlight the benefits of our method, and we further illustrate its practical applications using real-world data.https://papers.cool/arxiv/2406.13310A finite-infinite shared atoms nested model for the Bayesian analysis of large grouped data2024-06-21T00:00:00+00:00Laura D'AngeloFrancesco DentiThe use of hierarchical mixture priors with shared atoms has recently flourished in the Bayesian literature for partially exchangeable data. Leveraging on nested levels of mixtures, these models allow the estimation of a two-layered data partition: across groups and across observations. This paper discusses and compares the properties of such modeling strategies when the mixing weights are assigned either a finite-dimensional Dirichlet distribution or a Dirichlet process prior. Based on these considerations, we introduce a novel hierarchical nonparametric prior based on a finite set of shared atoms, a specification that enhances the flexibility of the induced random measures and the availability of fast posterior inference. To support these findings, we analytically derive the induced prior correlation structure and partially exchangeable partition probability function. Additionally, we develop a novel mean-field variational algorithm for posterior inference to boost the applicability of our nested model to large multivariate data. We then assess and compare the performance of the different shared-atom specifications via simulation. We also show that our variational proposal is highly scalable and that the accuracy of the posterior density estimate and the estimated partition is comparable with state-of-the-art Gibbs sampler algorithms. Finally, we apply our model to a real dataset of Spotify's song features, simultaneously segmenting artists and songs with similar characteristics.https://papers.cool/arxiv/2406.13425Coupled Input-Output Dimension Reduction: Application to Goal-oriented Bayesian Experimental Design and Global Sensitivity Analysis2024-06-21T00:00:00+00:00Qiao ChenElise ArnaudRicardo BaptistaOlivier ZahmWe introduce a new method to jointly reduce the dimension of the input and output space of a high-dimensional function. Choosing a reduced input subspace influences which output subspace is relevant and vice versa. Conventional methods focus on reducing either the input or output space, even though both are often reduced simultaneously in practice. Our coupled approach naturally supports goal-oriented dimension reduction, where either an input or output quantity of interest is prescribed. We consider, in particular, goal-oriented sensor placement and goal-oriented sensitivity analysis, which can be viewed as dimension reduction where the most important output or, respectively, input components are chosen. Both applications present difficult combinatorial optimization problems with expensive objectives such as the expected information gain and Sobol indices. By optimizing gradient-based bounds, we can determine the most informative sensors and most sensitive parameters as the largest diagonal entries of some diagnostic matrices, thus bypassing the combinatorial optimization and objective evaluation.https://papers.cool/arxiv/2406.13447High-probability minimax lower bounds2024-06-21T00:00:00+00:00Tianyi MaKabir A. VerchandRichard J. SamworthThe minimax risk is often considered as a gold standard against which we can compare specific statistical procedures. Nevertheless, as has been observed recently in robust and heavy-tailed estimation problems, the inherent reduction of the (random) loss to its expectation may entail a significant loss of information regarding its tail behaviour. In an attempt to avoid such a loss, we introduce the notion of a minimax quantile, and seek to articulate its dependence on the quantile level. To this end, we develop high-probability variants of the classical Le Cam and Fano methods, as well as a technique to convert local minimax risk lower bounds to lower bounds on minimax quantiles. To illustrate the power of our framework, we deploy our techniques on several examples, recovering recent results in robust mean estimation and stochastic convex optimisation, as well as obtaining several new results in covariance matrix estimation, sparse linear regression, nonparametric density estimation and isotonic regression. Our overall goal is to argue that minimax quantiles can provide a finer-grained understanding of the difficulty of statistical problems, and that, in wide generality, lower bounds on these quantities can be obtained via user-friendly tools.https://papers.cool/arxiv/2406.13478Semiparametric Localized Principal Stratification Analysis with Continuous Strata2024-06-21T00:00:00+00:00Yichi ZhangShu YangPrincipal stratification is essential for revealing causal mechanisms involving post-treatment intermediate variables. Principal stratification analysis with continuous intermediate variables is increasingly common but challenging due to the infinite principal strata and the nonidentifiability and nonregularity of principal causal effects. Inspired by recent research, we resolve these challenges by first using a flexible copula-based principal score model to identify principal causal effect under weak principal ignorability. We then target the local functional substitute of principal causal effect, which is statistically regular and can accurately approximate principal causal effect with vanishing bandwidth. We simplify the full efficient influence function of the local functional substitute by considering its oracle-scenario alternative. This leads to a computationally efficient and straightforward estimator for the local functional substitute and principal causal effect with vanishing bandwidth. We prove the double robustness and statistical optimality of our proposed estimator, and derive its asymptotic normality for inferential purposes. We illustrate the appealing statistical performance of our proposed estimator in simulations, and apply it to two real datasets with intriguing scientific discoveries.https://papers.cool/arxiv/2406.13488Approximately Equivariant Neural Processes2024-06-21T00:00:00+00:00Matthew AshmanCristiana DiaconuAdrian WellerWessel BruinsmaRichard E. TurnerEquivariant deep learning architectures exploit symmetries in learning problems to improve the sample efficiency of neural-network-based models and their ability to generalise. However, when modelling real-world data, learning problems are often not exactly equivariant, but only approximately. For example, when estimating the global temperature field from weather station observations, local topographical features like mountains break translation equivariance. In these scenarios, it is desirable to construct architectures that can flexibly depart from exact equivariance in a data-driven way. In this paper, we develop a general approach to achieving this using existing equivariant architectures. Our approach is agnostic to both the choice of symmetry group and model architecture, making it widely applicable. We consider the use of approximately equivariant architectures in neural processes (NPs), a popular family of meta-learning models. We demonstrate the effectiveness of our approach on a number of synthetic and real-world regression experiments, demonstrating that approximately equivariant NP models can outperform both their non-equivariant and strictly equivariant counterparts.https://papers.cool/arxiv/2406.13500Gradient-Boosted Generalized Linear Models for Conditional Vine Copulas2024-06-21T00:00:00+00:00David JobstAnnette MöllerJürgen GroßVine copulas are flexible dependence models using bivariate copulas as building blocks. If the parameters of the bivariate copulas in the vine copula depend on covariates, one obtains a conditional vine copula. We propose an extension for the estimation of continuous conditional vine copulas, where the parameters of continuous conditional bivariate copulas are estimated sequentially and separately via gradient-boosting. For this purpose, we link covariates via generalized linear models (GLMs) to Kendall's $\tau$ correlation coefficient from which the corresponding copula parameter can be obtained. Consequently, the gradient-boosting algorithm estimates the copula parameters providing a natural covariate selection. In a second step, an additional covariate deselection procedure is applied. The performance of the gradient-boosted conditional vine copulas is illustrated in a simulation study. Linear covariate effects in low- and high-dimensional settings are investigated for the conditional bivariate copulas separately and for conditional vine copulas. Moreover, the gradient-boosted conditional vine copulas are applied to the temporal postprocessing of ensemble weather forecasts in a low-dimensional setting. The results show, that our suggested method is able to outperform the benchmark methods and identifies temporal correlations better. Eventually, we provide an R-package called boostCopula for this method.https://papers.cool/arxiv/2406.13513Sharp oracle inequalities and universality of the AIC and FPE2024-06-21T00:00:00+00:00Moritz JirakGeorg KöstenbergerIn two landmark papers, Akaike introduced the AIC and FPE, demonstrating their significant usefulness for prediction. In subsequent seminal works, Shibata developed a notion of asymptotic efficiency and showed that both AIC and FPE are optimal, setting the stage for decades-long developments and research in this area and beyond. Conceptually, the theory of efficiency is universal in the sense that it (formally) only relies on second-order properties of the underlying process $(X_t)_{t\in \mathbb{Z}}$, but, so far, almost all (efficiency) results require the much stronger assumption of a linear process with independent innovations. In this work, we establish sharp oracle inequalities subject only to a very general notion of weak dependence, establishing a universal property of the AIC and FPE. A direct corollary of our inequalities is asymptotic efficiency of these criteria. Our framework contains many prominent dynamical systems such as random walks on the regular group, functionals of iterated random systems, functionals of (augmented) Garch models of any order, functionals of (Banach space valued) linear processes, possibly infinite memory Markov chains, dynamical systems arising from SDEs, and many more.https://papers.cool/arxiv/2406.13619Generative Modeling by Minimizing the Wasserstein-2 Loss2024-06-21T00:00:00+00:00Yu-Jui HuangZachariah MalikThis paper approaches the unsupervised learning problem by minimizing the second-order Wasserstein loss (the $W_2$ loss). The minimization is characterized by a distribution-dependent ordinary differential equation (ODE), whose dynamics involves the Kantorovich potential between a current estimated distribution and the true data distribution. A main result shows that the time-marginal law of the ODE converges exponentially to the true data distribution. To prove that the ODE has a unique solution, we first construct explicitly a solution to the associated nonlinear Fokker-Planck equation and show that it coincides with the unique gradient flow for the $W_2$ loss. Based on this, a unique solution to the ODE is built from Trevisan's superposition principle and the exponential convergence results. An Euler scheme is proposed for the distribution-dependent ODE and it is shown to correctly recover the gradient flow for the $W_2$ loss in the limit. An algorithm is designed by following the scheme and applying persistent training, which is natural in our gradient-flow framework. In both low- and high-dimensional experiments, our algorithm converges much faster than and outperforms Wasserstein generative adversarial networks, by increasing the level of persistent training appropriately.https://papers.cool/arxiv/2406.13635Temporal label recovery from noisy dynamical data2024-06-21T00:00:00+00:00Yuehaw KhooXin T. TongWanjie WangYuguan WangAnalyzing dynamical data often requires information of the temporal labels, but such information is unavailable in many applications. Recovery of these temporal labels, closely related to the seriation or sequencing problem, becomes crucial in the study. However, challenges arise due to the nonlinear nature of the data and the complexity of the underlying dynamical system, which may be periodic or non-periodic. Additionally, noise within the feature space complicates the theoretical analysis. Our work develops spectral algorithms that leverage manifold learning concepts to recover temporal labels from noisy data. We first construct the graph Laplacian of the data, and then employ the second (and the third) Fiedler vectors to recover temporal labels. This method can be applied to both periodic and aperiodic cases. It also does not require monotone properties on the similarity matrix, which are commonly assumed in existing spectral seriation algorithms. We develop the $\ell_{\infty}$ error of our estimators for the temporal labels and ranking, without assumptions on the eigen-gap. In numerical analysis, our method outperforms spectral seriation algorithms based on a similarity matrix. The performance of our algorithms is further demonstrated on a synthetic biomolecule data example.https://papers.cool/arxiv/2406.13691Computationally efficient multi-level Gaussian process regression for functional data observed under completely or partially regular sampling designs2024-06-21T00:00:00+00:00Adam Gorm HoffmannClaus Thorn EkstrømAndreas Kryger JensenGaussian process regression is a frequently used statistical method for flexible yet fully probabilistic non-linear regression modeling. A common obstacle is its computational complexity which scales poorly with the number of observations. This is especially an issue when applying Gaussian process models to multiple functions simultaneously in various applications of functional data analysis. We consider a multi-level Gaussian process regression model where a common mean function and individual subject-specific deviations are modeled simultaneously as latent Gaussian processes. We derive exact analytic and computationally efficient expressions for the log-likelihood function and the posterior distributions in the case where the observations are sampled on either a completely or partially regular grid. This enables us to fit the model to large data sets that are currently computationally inaccessible using a standard implementation. We show through a simulation study that our analytic expressions are several orders of magnitude faster compared to a standard implementation, and we provide an implementation in the probabilistic programming language Stan.https://papers.cool/arxiv/2406.13814Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning Approaches2024-06-21T00:00:00+00:00Dandan TangXin TongMissing Not at Random (MNAR) and nonnormal data are challenging to handle. Traditional missing data analytical techniques such as full information maximum likelihood estimation (FIML) may fail with nonnormal data as they are built on normal distribution assumptions. Two-Stage Robust Estimation (TSRE) does manage nonnormal data, but both FIML and TSRE are less explored in longitudinal studies under MNAR conditions with nonnormal distributions. Unlike traditional statistical approaches, machine learning approaches do not require distributional assumptions about the data. More importantly, they have shown promise for MNAR data; however, their application in longitudinal studies, addressing both Missing at Random (MAR) and MNAR scenarios, is also underexplored. This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework. These techniques include traditional approaches like FIML and TSRE, machine learning approaches by single imputation (K-Nearest Neighbors and missForest), and machine learning approaches by multiple imputation (micecart and miceForest). We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation. Our findings indicate that FIML is most effective for MNAR data among the tested approaches. TSRE excels in handling MAR data, while missForest is only advantageous in limited conditions with a combination of very skewed distributions, very large sample sizes (e.g., n larger than 1000), and low missing data rates.https://papers.cool/arxiv/2406.13833Cluster Quilting: Spectral Clustering for Patchwork Learning2024-06-21T00:00:00+00:00Lili ZhengAndersen ChangGenevera I. AllenPatchwork learning arises as a new and challenging data collection paradigm where both samples and features are observed in fragmented subsets. Due to technological limits, measurement expense, or multimodal data integration, such patchwork data structures are frequently seen in neuroscience, healthcare, and genomics, among others. Instead of analyzing each data patch separately, it is highly desirable to extract comprehensive knowledge from the whole data set. In this work, we focus on the clustering problem in patchwork learning, aiming at discovering clusters amongst all samples even when some are never jointly observed for any feature. We propose a novel spectral clustering method called Cluster Quilting, consisting of (i) patch ordering that exploits the overlapping structure amongst all patches, (ii) patchwise SVD, (iii) sequential linear mapping of top singular vectors for patch overlaps, followed by (iv) k-means on the combined and weighted singular vectors. Under a sub-Gaussian mixture model, we establish theoretical guarantees via a non-asymptotic misclustering rate bound that reflects both properties of the patch-wise observation regime as well as the clustering signal and noise dependencies. We also validate our Cluster Quilting algorithm through extensive empirical studies on both simulated and real data sets in neuroscience and genomics, where it discovers more accurate and scientifically more plausible clusters than other approaches.https://papers.cool/arxiv/2406.13836Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions2024-06-21T00:00:00+00:00Tal AgassiNir KeretMalka GorfineIn the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the optimal subsample size. To bridge this gap, our work introduces tools designed for choosing the optimal subsample size. We focus on three settings: the Cox regression model for survival data with rare events and logistic regression for both balanced and imbalanced datasets. Additionally, we present a novel optimal subsampling procedure tailored for logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets.https://papers.cool/arxiv/2406.13876An Empirical Bayes Jackknife Regression Framework for Covariance Matrix Estimation2024-06-21T00:00:00+00:00Huqin XinSihai Dave ZhaoCovariance matrix estimation, a classical statistical topic, poses significant challenges when the sample size is comparable to or smaller than the number of features. In this paper, we frame covariance matrix estimation as a compound decision problem and apply an optimal decision rule to estimate covariance parameters. To approximate this rule, we introduce an algorithm that integrates jackknife techniques with machine learning regression methods. This algorithm exhibits adaptability across diverse scenarios without relying on assumptions about data distribution. Simulation results and gene network inference from an RNA-seq experiment in mice demonstrate that our approach either matches or surpasses several state-of-the-art methodshttps://papers.cool/arxiv/2406.13906Semi-supervised Regression Analysis with Model Misspecification and High-dimensional Data2024-06-21T00:00:00+00:00Ye TianPeng WuZhiqiang TanThe accessibility of vast volumes of unlabeled data has sparked growing interest in semi-supervised learning (SSL) and covariate shift transfer learning (CSTL). In this paper, we present an inference framework for estimating regression coefficients in conditional mean models within both SSL and CSTL settings, while allowing for the misspecification of conditional mean models. We develop an augmented inverse probability weighted (AIPW) method, employing regularized calibrated estimators for both propensity score (PS) and outcome regression (OR) nuisance models, with PS and OR models being sequentially dependent. We show that when the PS model is correctly specified, the proposed estimator achieves consistency, asymptotic normality, and valid confidence intervals, even with possible OR model misspecification and high-dimensional data. Moreover, by suppressing detailed technical choices, we demonstrate that previous methods can be unified within our AIPW framework. Our theoretical findings are verified through extensive simulation studies and a real-world data application.https://papers.cool/arxiv/2406.13936Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods2024-06-21T00:00:00+00:00Tim Tsz-Kit LauWeijian LiChenwei XuHan LiuMladen KolarModern deep neural networks often require distributed training with many workers due to their large size. As worker numbers increase, communication overheads become the main bottleneck in data-parallel minibatch stochastic gradient methods with per-iteration gradient synchronization. Local gradient methods like Local SGD reduce communication by only syncing after several local steps. Despite understanding their convergence in i.i.d. and heterogeneous settings and knowing the importance of batch sizes for efficiency and generalization, optimal local batch sizes are difficult to determine. We introduce adaptive batch size strategies for local gradient methods that increase batch sizes adaptively to reduce minibatch gradient variance. We provide convergence guarantees under homogeneous data conditions and support our claims with image classification experiments, demonstrating the effectiveness of our strategies in training and generalization.https://papers.cool/arxiv/2406.13938Coverage of Credible Sets for Regression under Variable Selection2024-06-21T00:00:00+00:00Samhita PalSubhashis GhosalWe study the asymptotic frequentist coverage of credible sets based on a novel Bayesian approach for a multiple linear regression model under variable selection. We initially ignore the issue of variable selection, which allows us to put a conjugate normal prior on the coefficient vector. The variable selection step is incorporated directly in the posterior through a sparsity-inducing map and uses the induced prior for making an inference instead of the natural conjugate posterior. The sparsity-inducing map minimizes the sum of the squared l2-distance weighted by the data matrix and a suitably scaled l1-penalty term. We obtain the limiting coverage of various credible regions and demonstrate that a modified credible interval for a component has the exact asymptotic frequentist coverage if the corresponding predictor is asymptotically uncorrelated with other predictors. Through extensive simulation, we provide a guideline for choosing the penalty parameter as a function of the credibility level appropriate for the corresponding coverage. We also show finite-sample numerical results that support the conclusions from the asymptotic theory. We also provide the credInt package that implements the method in R to obtain the credible intervals along with the posterior samples.https://papers.cool/arxiv/2406.13944Generalization error of min-norm interpolators in transfer learning2024-06-21T00:00:00+00:00Yanke SongSohom BhattacharyaPragya SurThis paper establishes the generalization error of pooled min-$\ell_2$-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of min-norm interpolation in this setting are not well-understood. We address this gap by characterizing the bias and variance of pooled min-$\ell_2$-norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signal-to-noise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shift-to-signal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a data-driven method to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finite-sample efficacy of our results.https://papers.cool/arxiv/2406.13989Random pairing MLE for estimation of item parameters in Rasch model2024-06-21T00:00:00+00:00Yuepeng YangCong MaThe Rasch model, a classical model in the item response theory, is widely used in psychometrics to model the relationship between individuals' latent traits and their binary responses on assessments or questionnaires. In this paper, we introduce a new likelihood-based estimator -- random pairing maximum likelihood estimator ($\mathsf{RP\text{-}MLE}$) and its bootstrapped variant multiple random pairing MLE ($\mathsf{MRP\text{-}MLE}$) that faithfully estimate the item parameters in the Rasch model. The new estimators have several appealing features compared to existing ones. First, both work for sparse observations, an increasingly important scenario in the big data era. Second, both estimators are provably minimax optimal in terms of finite sample $\ell_{\infty}$ estimation error. Lastly, $\mathsf{RP\text{-}MLE}$ admits precise distributional characterization that allows uncertainty quantification on the item parameters, e.g., construction of confidence intervals of the item parameters. The main idea underlying $\mathsf{RP\text{-}MLE}$ and $\mathsf{MRP\text{-}MLE}$ is to randomly pair user-item responses to form item-item comparisons. This is carefully designed to reduce the problem size while retaining statistical independence. We also provide empirical evidence of the efficacy of the two new estimators using both simulated and real data.https://papers.cool/arxiv/2406.13995Prediction of Unobserved Bifurcation by Unsupervised Extraction of Slowly Time-Varying System Parameter Dynamics from Time Series Using Reservoir Computing2024-06-21T00:00:00+00:00Keita TokudaYuichi KatoriNonlinear and non-stationary processes are prevalent in various natural and physical phenomena, where system dynamics can change qualitatively due to bifurcation phenomena. Traditional machine learning methods have advanced our ability to learn and predict such systems from observed time series data. However, predicting the behavior of systems with temporal parameter variations without knowledge of true parameter values remains a significant challenge. This study leverages the reservoir computing framework to address this problem by unsupervised extraction of slowly varying system parameters from time series data. We propose a model architecture consisting of a slow reservoir with long timescale internal dynamics and a fast reservoir with short timescale dynamics. The slow reservoir extracts the temporal variation of system parameters, which are then used to predict unknown bifurcations in the fast dynamics. Through experiments using data generated from chaotic dynamical systems, we demonstrate the ability to predict bifurcations not present in the training data. Our approach shows potential for applications in fields such as neuroscience, material science, and weather prediction, where slow dynamics influencing qualitative changes are often unobservable.https://papers.cool/arxiv/2406.14003Deep Optimal Experimental Design for Parameter Estimation Problems2024-06-21T00:00:00+00:00Md Shahriar Rahim SiddiquiArman RahmimEldad HaberOptimal experimental design is a well studied field in applied science and engineering. Techniques for estimating such a design are commonly used within the framework of parameter estimation. Nonetheless, in recent years parameter estimation techniques are changing rapidly with the introduction of deep learning techniques to replace traditional estimation methods. This in turn requires the adaptation of optimal experimental design that is associated with these new techniques. In this paper we investigate a new experimental design methodology that uses deep learning. We show that the training of a network as a Likelihood Free Estimator can be used to significantly simplify the design process and circumvent the need for the computationally expensive bi-level optimization problem that is inherent in optimal experimental design for non-linear systems. Furthermore, deep design improves the quality of the recovery process for parameter estimation problems. As proof of concept we apply our methodology to two different systems of Ordinary Differential Equations.https://papers.cool/arxiv/2406.14009Confidence Intervals and Simultaneous Confidence Bands Based on Deep Learning2024-06-21T00:00:00+00:00Asaf Ben ArieMalka GorfineDeep learning models have significantly improved prediction accuracy in various fields, gaining recognition across numerous disciplines. Yet, an aspect of deep learning that remains insufficiently addressed is the assessment of prediction uncertainty. Producing reliable uncertainty estimators could be crucial in practical terms. For instance, predictions associated with a high degree of uncertainty could be sent for further evaluation. Recent works in uncertainty quantification of deep learning predictions, including Bayesian posterior credible intervals and a frequentist confidence-interval estimation, have proven to yield either invalid or overly conservative intervals. Furthermore, there is currently no method for quantifying uncertainty that can accommodate deep neural networks for survival (time-to-event) data that involves right-censored outcomes. In this work, we provide a valid non-parametric bootstrap method that correctly disentangles data uncertainty from the noise inherent in the adopted optimization algorithm, ensuring that the resulting point-wise confidence intervals or the simultaneous confidence bands are accurate (i.e., valid and not overly conservative). The proposed ad-hoc method can be easily integrated into any deep neural network without interfering with the training process. The utility of the proposed approach is illustrated by constructing simultaneous confidence bands for survival curves derived from deep neural networks for survival data with right censoring.https://papers.cool/arxiv/2406.14033Ensembles of Probabilistic Regression Trees2024-06-21T00:00:00+00:00Alexandre SeillerÉric GaussierEmilie DevijverMarianne ClauselSami AlkhouryTree-based ensemble methods such as random forests, gradient-boosted trees, and Bayesianadditive regression trees have been successfully used for regression problems in many applicationsand research studies. In this paper, we study ensemble versions of probabilisticregression trees that provide smooth approximations of the objective function by assigningeach observation to each region with respect to a probability distribution. We prove thatthe ensemble versions of probabilistic regression trees considered are consistent, and experimentallystudy their bias-variance trade-off and compare them with the state-of-the-art interms of performance prediction.https://papers.cool/arxiv/2406.14040A Practical Diffusion Path for Sampling2024-06-21T00:00:00+00:00Omar ChehabAnna KorbaDiffusion models are state-of-the-art methods in generative modeling when samples from a target probability distribution are available, and can be efficiently sampled, using score matching to estimate score vectors guiding a Langevin process. However, in the setting where samples from the target are not available, e.g. when this target's density is known up to a normalization constant, the score estimation task is challenging. Previous approaches rely on Monte Carlo estimators that are either computationally heavy to implement or sample-inefficient. In this work, we propose a computationally attractive alternative, relying on the so-called dilation path, that yields score vectors that are available in closed-form. This path interpolates between a Dirac and the target distribution using a convolution. We propose a simple implementation of Langevin dynamics guided by the dilation path, using adaptive step-sizes. We illustrate the results of our sampling method on a range of tasks, and shows it performs better than classical alternatives.https://papers.cool/arxiv/2406.14071Bayesian Bandit Algorithms with Approximate Inference in Stochastic Linear Bandits2024-06-21T00:00:00+00:00Ziyi HuangHenry LamHaofeng ZhangBayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. Nevertheless, their theoretical justification is less investigated in the literature, especially for contextual bandit problems. To fill this gap, we propose a general theoretical framework to analyze stochastic linear bandits in the presence of approximate inference and conduct regret analysis on two Bayesian bandit algorithms, Linear Thompson sampling (LinTS) and the extension of Bayesian Upper Confidence Bound, namely Linear Bayesian Upper Confidence Bound (LinBUCB). We demonstrate that both LinTS and LinBUCB can preserve their original rates of regret upper bound but with a sacrifice of larger constant terms when applied with approximate inference. These results hold for general Bayesian inference approaches, under the assumption that the inference error measured by two different $\alpha$-divergences is bounded. Additionally, by introducing a new definition of well-behaved distributions, we show that LinBUCB improves the regret rate of LinTS from $\tilde{O}(d^{3/2}\sqrt{T})$ to $\tilde{O}(d\sqrt{T})$, matching the minimax optimal rate. To our knowledge, this work provides the first regret bounds in the setting of stochastic linear bandits with bounded approximate inference errors.https://papers.cool/arxiv/2406.14140Nonparametric Jackknife Instrumental Variable Estimation and Confounding Robust Surrogate Indices2024-06-21T00:00:00+00:00Aurélien BibautNathan KallusApoorva LalJackknife instrumental variable estimation (JIVE) is a classic method to leverage many weak instrumental variables (IVs) to estimate linear structural models, overcoming the bias of standard methods like two-stage least squares. In this paper, we extend the jackknife approach to nonparametric IV (NPIV) models with many weak IVs. Since NPIV characterizes the structural regression as having residuals projected onto the IV being zero, existing approaches minimize an estimate of the average squared projected residuals, but their estimates are biased under many weak IVs. We introduce an IV splitting device inspired by JIVE to remove this bias, and by carefully studying this split-IV empirical process we establish learning rates that depend on generic complexity measures of the nonparametric hypothesis class. We then turn to leveraging this for semiparametric inference on average treatment effects (ATEs) on unobserved long-term outcomes predicted from short-term surrogates, using historical experiments as IVs to learn this nonparametric predictive relationship even in the presence of confounding between short- and long-term observations. Using split-IV estimates of a debiasing nuisance, we develop asymptotically normal estimates for predicted ATEs, enabling inference.https://papers.cool/arxiv/2406.14145Temperature in the Iberian Peninsula: Trend, seasonality, and heterogeneity2024-06-21T00:00:00+00:00C. Vladimir Rodríguez-CaballeroEsther RuizIn this paper, we propose fitting unobserved component models to represent the dynamic evolution of bivariate systems of centre and log-range temperatures obtained monthly from minimum/maximum temperatures observed at a given location. In doing so, the centre and log-range temperature are decomposed into potentially stochastic trends, seasonal, and transitory components. Since our model encompasses deterministic trends and seasonal components as limiting cases, we contribute to the debate on whether stochastic or deterministic components better represent the trend and seasonal components. The methodology is implemented to centre and log-range temperature observed in four locations in the Iberian Peninsula, namely, Barcelona, Coru\~{n}a, Madrid, and Seville. We show that, at each location, the centre temperature can be represented by a smooth integrated random walk with time-varying slope, while a stochastic level better represents the log-range. We also show that centre and log-range temperature are unrelated. The methodology is then extended to simultaneously model centre and log-range temperature observed at several locations in the Iberian Peninsula. We fit a multi-level dynamic factor model to extract potential commonalities among centre (log-range) temperature while also allowing for heterogeneity in different areas in the Iberian Peninsula. We show that, although the commonality in trends of average temperature is considerable, the regional components are also relevant.https://papers.cool/arxiv/2406.14159Enhancing multivariate post-processed visibility predictions utilizing CAMS forecasts2024-06-21T00:00:00+00:00Mária LakatosSándor BaranIn our contemporary era, meteorological weather forecasts increasingly incorporate ensemble predictions of visibility - a parameter of great importance in aviation, maritime navigation, and air quality assessment, with direct implications for public health. However, this weather variable falls short of the predictive accuracy achieved for other quantities issued by meteorological centers. Therefore, statistical post-processing is recommended to enhance the reliability and accuracy of predictions. By estimating the predictive distributions of the variables with the aid of historical observations and forecasts, one can achieve statistical consistency between true observations and ensemble predictions. Visibility observations, following the recommendation of the World Meteorological Organization, are typically reported in discrete values; hence, the predictive distribution of the weather quantity takes the form of a discrete parametric law. Recent studies demonstrated that the application of classification algorithms can successfully improve the skill of such discrete forecasts; however, a frequently emerging issue is that certain spatial and/or temporal dependencies could be lost between marginals. Based on visibility ensemble forecasts of the European Centre for Medium-Range Weather Forecasts for 30 locations in Central Europe, we investigate whether the inclusion of Copernicus Atmosphere Monitoring Service (CAMS) predictions of the same weather quantity as an additional covariate could enhance the skill of the post-processing methods and whether it contributes to the successful integration of spatial dependence between marginals. Our study confirms that post-processed forecasts are substantially superior to raw and climatological predictions, and the utilization of CAMS forecasts provides a further significant enhancement both in the univariate and multivariate setup.https://papers.cool/arxiv/2406.14182Averaging polyhazard models using Piecewise deterministic Monte Carlo with applications to data with long-term survivors2024-06-21T00:00:00+00:00Luke HardcastleSamuel LivingstoneGianluca BaioPolyhazard models are a class of flexible parametric models for modelling survival over extended time horizons. Their additive hazard structure allows for flexible, non-proportional hazards whose characteristics can change over time while retaining a parametric form, which allows for survival to be extrapolated beyond the observation period of a study. Significant user input is required, however, in selecting the number of latent hazards to model, their distributions and the choice of which variables to associate with each hazard. The resulting set of models is too large to explore manually, limiting their practical usefulness. Motivated by applications to stroke survivor and kidney transplant patient survival times we extend the standard polyhazard model through a prior structure allowing for joint inference of parameters and structural quantities, and develop a sampling scheme that utilises state-of-the-art Piecewise Deterministic Markov Processes to sample from the resulting transdimensional posterior with minimal user tuning.https://papers.cool/arxiv/2406.14184On integral priors for multiple comparison in Bayesian model selection2024-06-21T00:00:00+00:00Diego SalmerónJuan Antonio CanoChristian P. RobertNoninformative priors constructed for estimation purposes are usually not appropriate for model selection and testing. The methodology of integral priors was developed to get prior distributions for Bayesian model selection when comparing two models, modifying initial improper reference priors. We propose a generalization of this methodology to more than two models. Our approach adds an artificial copy of each model under comparison by compactifying the parametric space and creating an ergodic Markov chain across all models that returns the integral priors as marginals of the stationary distribution. Besides the garantee of their existance and the lack of paradoxes attached to estimation reference priors, an additional advantage of this methodology is that the simulation of this Markov chain is straightforward as it only requires simulations of imaginary training samples for all models and from the corresponding posterior distributions. This renders its implementation automatic and generic, both in the nested case and in the nonnested case.https://papers.cool/arxiv/2406.14269Concentration of a sparse Bayesian model with Horseshoe prior in estimating high-dimensional precision matrix2024-06-21T00:00:00+00:00The Tien MaiPrecision matrices are crucial in many fields such as social networks, neuroscience, and economics, representing the edge structure of Gaussian graphical models (GGMs), where a zero in an off-diagonal position of the precision matrix indicates conditional independence between nodes. In high-dimensional settings where the dimension of the precision matrix $p$ exceeds the sample size $n$ and the matrix is sparse, methods like graphical Lasso, graphical SCAD, and CLIME are popular for estimating GGMs. While frequentist methods are well-studied, Bayesian approaches for (unstructured) sparse precision matrices are less explored. The graphical horseshoe estimate by \citet{li2019graphical}, applying the global-local horseshoe prior, shows superior empirical performance, but theoretical work for sparse precision matrix estimations using shrinkage priors is limited. This paper addresses these gaps by providing concentration results for the tempered posterior with the fully specified horseshoe prior in high-dimensional settings. Moreover, we also provide novel theoretical results for model misspecification, offering a general oracle inequality for the posterior.https://papers.cool/arxiv/2406.14292Proximal Interacting Particle Langevin Algorithms2024-06-21T00:00:00+00:00Paula Cordero EncinarFrancesca R. CrucinioO. Deniz AkyildizWe introduce a class of algorithms, termed Proximal Interacting Particle Langevin Algorithms (PIPLA), for inference and learning in latent variable models whose joint probability density is non-differentiable. Leveraging proximal Markov chain Monte Carlo (MCMC) techniques and the recently introduced interacting particle Langevin algorithm (IPLA), we propose several variants within the novel proximal IPLA family, tailored to the problem of estimating parameters in a non-differentiable statistical model. We prove nonasymptotic bounds for the parameter estimates produced by multiple algorithms in the strongly log-concave setting and provide comprehensive numerical experiments on various models to demonstrate the effectiveness of the proposed methods. In particular, we demonstrate the utility of the proposed family of algorithms on a toy hierarchical example where our assumptions can be checked, as well as on the problems of sparse Bayesian logistic regression, sparse Bayesian neural network, and sparse matrix completion. Our theory and experiments together show that PIPLA family can be the de facto choice for parameter estimation problems in latent variable models for non-differentiable models.https://papers.cool/arxiv/2406.14302Identifiable Exchangeable Mechanisms for Causal Structure and Representation Learning2024-06-21T00:00:00+00:00Patrik ReizingerSiyuan GuoFerenc HuszárBernhard SchölkopfWieland BrendelIdentifying latent representations or causal structures is important for good generalization and downstream task performance. However, both fields have been developed rather independently. We observe that several methods in both representation and causal structure learning rely on the same data-generating process (DGP), namely, exchangeable but not i.i.d. (independent and identically distributed) data. We provide a unified framework, termed Identifiable Exchangeable Mechanisms (IEM), for representation and structure learning under the lens of exchangeability. IEM provides new insights that let us relax the necessary conditions for causal structure identification in exchangeable non--i.i.d. data. We also demonstrate the existence of a duality condition in identifiable representation learning, leading to new identifiability results. We hope this work will pave the way for further research in causal representation learning.https://papers.cool/arxiv/2406.14426Transferable Boltzmann Generators2024-06-21T00:00:00+00:00Leon KleinFrank NoéThe generation of equilibrium samples of molecular systems has been a long-standing problem in statistical physics. Boltzmann Generators are a generative machine learning method that addresses this issue by learning a transformation via a normalizing flow from a simple prior distribution to the target Boltzmann distribution of interest. Recently, flow matching has been employed to train Boltzmann Generators for small molecular systems in Cartesian coordinates. We extend this work and propose a first framework for Boltzmann Generators that are transferable across chemical space, such that they predict zero-shot Boltzmann distributions for test molecules without being retrained for these systems. These transferable Boltzmann Generators allow approximate sampling from the target distribution of unseen systems, as well as efficient reweighting to the target Boltzmann distribution. The transferability of the proposed framework is evaluated on dipeptides, where we show that it generalizes efficiently to unseen systems. Furthermore, we demonstrate that our proposed architecture enhances the efficiency of Boltzmann Generators trained on single molecular systems.https://papers.cool/arxiv/2406.14451Gradient Estimation via Differentiable Metropolis-Hastings2024-06-21T00:00:00+00:00Gaurav AryaMoritz SchauerRuben SeyerMetropolis-Hastings estimates intractable expectations - can differentiating the algorithm estimate their gradients? The challenge is that Metropolis-Hastings trajectories are not conventionally differentiable due to the discrete accept/reject steps. Using a technique based on recoupling chains, our method differentiates through the Metropolis-Hastings sampler itself, allowing us to estimate gradients with respect to a parameter of otherwise intractable expectations. Our main contribution is a proof of strong consistency and a central limit theorem for our estimator under assumptions that hold in common Bayesian inference problems. The proofs augment the sampler chain with latent information, and formulate the estimator as a stopping tail functional of this augmented chain. We demonstrate our method on examples of Bayesian sensitivity analysis and optimizing a random walk Metropolis proposal.https://papers.cool/arxiv/2406.14453The Effective Number of Parameters in Kernel Density Estimation2024-06-21T00:00:00+00:00Sofia GuglielminiIgor VolobouevAlexandre TrindadeThe quest for a formula that satisfactorily measures the effective degrees of freedom in kernel density estimation (KDE) is a long standing problem with few solutions. Starting from the orthogonal polynomial sequence (OPS) expansion for the ratio of the empirical to the oracle density, we show how convolution with the kernel leads to a new OPS with respect to which one may express the resulting KDE. The expansion coefficients of the two OPS systems can then be related via a kernel sensitivity matrix, and this then naturally leads to a definition of effective parameters by taking the trace of a symmetrized positive semi-definite normalized version. The resulting effective degrees of freedom (EDoF) formula is an oracle-based quantity; the first ever proposed in the literature. Asymptotic properties of the empirical EDoF are worked out through influence functions. Numerical investigations confirm the theoretical insights.https://papers.cool/arxiv/2406.14535On estimation and order selection for multivariate extremes via clustering2024-06-21T00:00:00+00:00Shiyuan DengHe TangShuyang BaiWe investigate the estimation of multivariate extreme models with a discrete spectral measure using spherical clustering techniques. The primary contribution involves devising a method for selecting the order, that is, the number of clusters. The method consistently identifies the true order, i.e., the number of spectral atoms, and enjoys intuitive implementation in practice. Specifically, we introduce an extra penalty term to the well-known simplified average silhouette width, which penalizes small cluster sizes and small dissimilarities between cluster centers. Consequently, we provide a consistent method for determining the order of a max-linear factor model, where a typical information-based approach is not viable. Our second contribution is a large-deviation-type analysis for estimating the discrete spectral measure through clustering methods, which serves as an assessment of the convergence quality of clustering-based estimation for multivariate extremes. Additionally, as a third contribution, we discuss how estimating the discrete measure can lead to parameter estimations of heavy-tailed factor models. We also present simulations and real-data studies that demonstrate order selection and factor model estimation.