2024-09-18 | | Total: 14
We study the problem of system identification for stochastic continuous-time dynamics, based on a single finite-length state trajectory. We present a method for estimating the possibly unstable open-loop matrix by employing properly randomized control inputs. Then, we establish theoretical performance guarantees showing that the estimation error decays with trajectory length, a measure of excitability, and the signal-to-noise ratio, while it grows with dimension. Numerical illustrations that showcase the rates of learning the dynamics, will be provided as well. To perform the theoretical analysis, we develop new technical tools that are of independent interest. That includes non-asymptotic stochastic bounds for highly non-stationary martingales and generalized laws of iterated logarithms, among others.
The growing demand for accurate, efficient, and scalable solutions in computational mechanics highlights the need for advanced operator learning algorithms that can efficiently handle large datasets while providing reliable uncertainty quantification. This paper introduces a novel Gaussian Process (GP) based neural operator for solving parametric differential equations. The approach proposed leverages the expressive capability of deterministic neural operators and the uncertainty awareness of conventional GP. In particular, we propose a ``neural operator-embedded kernel'' wherein the GP kernel is formulated in the latent space learned using a neural operator. Further, we exploit a stochastic dual descent (SDD) algorithm for simultaneously training the neural operator parameters and the GP hyperparameters. Our approach addresses the (a) resolution dependence and (b) cubic complexity of traditional GP models, allowing for input-resolution independence and scalability in high-dimensional and non-linear parametric systems, such as those encountered in computational mechanics. We apply our method to a range of non-linear parametric partial differential equations (PDEs) and demonstrate its superiority in both computational efficiency and accuracy compared to standard GP models and wavelet neural operators. Our experimental results highlight the efficacy of this framework in solving complex PDEs while maintaining robustness in uncertainty estimation, positioning it as a scalable and reliable operator-learning algorithm for computational mechanics.
Numerical simulation is powerful to study nonlinear solid mechanics problems. However, mesh-based or particle-based numerical methods suffer from the common shortcoming of being time-consuming, particularly for complex problems with real-time analysis requirements. This study presents a clustering adaptive Gaussian process regression (CAG) method aiming for real-time prediction for nonlinear structural responses in solid mechanics. It is a data-driven machine learning method featuring a small sample size, high accuracy, and high efficiency, leveraging nonlinear structural response patterns. Similar to the traditional Gaussian process regression (GPR) method, it operates in offline and online stages. In the offline stage, an adaptive sample generation technique is introduced to cluster datasets into distinct patterns for demand-driven sample allocation. This ensures comprehensive coverage of the critical samples for the solution space of interest. In the online stage, following the divide-and-conquer strategy, a pre-prediction classification categorizes problems into predefined patterns sequentially predicted by the trained multi-pattern Gaussian process regressor. In addition, dimension reduction and restoration techniques are employed in the proposed method to enhance its efficiency. A set of problems involving material, geometric, and boundary condition nonlinearities is presented to demonstrate the CAG method's abilities. The proposed method can offer predictions within a second and attain high precision with only about 20 samples within the context of this study, outperforming the traditional GPR using uniformly distributed samples for error reductions ranging from 1 to 3 orders of magnitude. The CAG method is expected to offer a powerful tool for real-time prediction of nonlinear solid mechanical problems and shed light on the complex nonlinear structural response pattern.
We propose a general approach for encouraging fairness in survival analysis models based on minimizing a worst-case error across all subpopulations that occur with at least a user-specified probability. This approach can be used to convert many existing survival analysis models into ones that simultaneously encourage fairness, without requiring the user to specify which attributes or features to treat as sensitive in the training loss function. From a technical standpoint, our approach applies recent developments of distributionally robust optimization (DRO) to survival analysis. The complication is that existing DRO theory uses a training loss function that decomposes across contributions of individual data points, i.e., any term that shows up in the loss function depends only on a single training point. This decomposition does not hold for commonly used survival loss functions, including for the Cox proportional hazards model, its deep neural network variants, and many other recently developed models that use loss functions involving ranking or similarity score calculations. We address this technical hurdle using a sample splitting strategy. We demonstrate our sample splitting DRO approach by using it to create fair versions of a diverse set of existing survival analysis models including the Cox model (and its deep variant DeepSurv), the discrete-time model DeepHit, and the neural ODE model SODEN. We also establish a finite-sample theoretical guarantee to show what our sample splitting DRO loss converges to. For the Cox model, we further derive an exact DRO approach that does not use sample splitting. For all the models that we convert into DRO variants, we show that the DRO variants often score better on recently established fairness metrics (without incurring a significant drop in accuracy) compared to existing survival analysis fairness regularization techniques.
Background: Advanced methods for causal inference, such as targeted maximum likelihood estimation (TMLE), require certain conditions for statistical inference. However, in situations where there is not differentiability due to data sparsity or near-positivity violations, the Donsker class condition is violated. In such situations, TMLE variance can suffer from inflation of the type I error and poor coverage, leading to conservative confidence intervals. Cross-validation of the TMLE algorithm (CVTMLE) has been suggested to improve on performance compared to TMLE in settings of positivity or Donsker class violations. We aim to investigate the performance of CVTMLE compared to TMLE in various settings. Methods: We utilised the data-generating mechanism as described in Leger et al. (2022) to run a Monte Carlo experiment under different Donsker class violations. Then, we evaluated the respective statistical performances of TMLE and CVTMLE with different super learner libraries, with and without regression tree methods. Results: We found that CVTMLE vastly improves confidence interval coverage without adversely affecting bias, particularly in settings with small sample sizes and near-positivity violations. Furthermore, incorporating regression trees using standard TMLE with ensemble super learner-based initial estimates increases bias and variance leading to invalid statistical inference. Conclusions: It has been shown that when using CVTMLE the Donsker class condition is no longer necessary to obtain valid statistical inference when using regression trees and under either data sparsity or near-positivity violations. We show through simulations that CVTMLE is much less sensitive to the choice of the super learner library and thereby provides better estimation and inference in cases where the super learner library uses more flexible candidates and is prone to overfitting.
This paper considers a finite sample perspective on the problem of identifying an LTI system from a finite set of possible systems using trajectory data. To this end, we use the maximum likelihood estimator to identify the true system and provide an upper bound for its sample complexity. Crucially, the derived bound does not rely on a potentially restrictive stability assumption. Additionally, we leverage tools from information theory to provide a lower bound to the sample complexity that holds independently of the used estimator. The derived sample complexity bounds are analyzed analytically and numerically.
We study supervised classification for datasets with a very large number of input variables. The naïve Bayes classifier is attractive for its simplicity, scalability and effectiveness in many real data applications. When the strong naïve Bayes assumption of conditional independence of the input variables given the target variable is not valid, variable selection and model averaging are two common ways to improve the performance. In the case of the naïve Bayes classifier, the resulting weighting scheme on the models reduces to a weighting scheme on the variables. Here we focus on direct estimation of variable weights in such a weighted naïve Bayes classifier. We propose a sparse regularization of the model log-likelihood, which takes into account prior penalization costs related to each input variable. Compared to averaging based classifiers used up until now, our main goal is to obtain parsimonious robust models with less variables and equivalent performance. The direct estimation of the variable weights amounts to a non-convex optimization problem for which we propose and compare several two-stage algorithms. First, the criterion obtained by convex relaxation is minimized using several variants of standard gradient methods. Then, the initial non-convex optimization problem is solved using local optimization methods initialized with the result of the first stage. The various proposed algorithms result in optimization-based weighted naïve Bayes classifiers, that are evaluated on benchmark datasets and positioned w.r.t. to a reference averaging-based classifier.
Modelling longitudinal data is an important yet challenging task. These datasets can be high-dimensional, contain non-linear effects and time-varying covariates. Gaussian process (GP) prior-based variational autoencoders (VAEs) have emerged as a promising approach due to their ability to model time-series data. However, they are costly to train and struggle to fully exploit the rich covariates characteristic of longitudinal data, making them difficult for practitioners to use effectively. In this work, we leverage linear mixed models (LMMs) and amortized variational inference to provide conditional priors for VAEs, and propose LMM-VAE, a scalable, interpretable and identifiable model. We highlight theoretical connections between it and GP-based techniques, providing a unified framework for this class of methods. Our proposal performs competitively compared to existing approaches across simulated and real-world datasets.
Calibration ensures that predicted uncertainties align with observed uncertainties. While there is an extensive literature on recalibration methods for univariate probabilistic forecasts, work on calibration for multivariate forecasts is much more limited. This paper introduces a novel post-hoc recalibration approach that addresses multivariate calibration for potentially misspecified models. Our method involves constructing local mappings between vectors of marginal probability integral transform values and the space of observations, providing a flexible and model free solution applicable to continuous, discrete, and mixed responses. We present two versions of our approach: one uses K-nearest neighbors, and the other uses normalizing flows. Each method has its own strengths in different situations. We demonstrate the effectiveness of our approach on two real data applications: recalibrating a deep neural network's currency exchange rate forecast and improving a regression model for childhood malnutrition in India for which the multivariate response has both discrete and continuous components.
In this paper, we provide tight lower bounds for the oracle complexity of minimizing high-order Hölder smooth and uniformly convex functions. Specifically, for a function whose $p^{th}$-order derivatives are Hölder continuous with degree $\nu$ and parameter $H$, and that is uniformly convex with degree $q$ and parameter $\sigma$, we focus on two asymmetric cases: (1) $q > p + \nu$, and (2) $q < p+\nu$. Given up to $p^{th}$-order oracle access, we establish worst-case oracle complexities of $\Omega\left( \left( \frac{H}{\sigma}\right)^\frac{2}{3(p+\nu)-2}\left( \frac{\sigma}{\epsilon}\right)^\frac{2(q-p-\nu)}{q(3(p+\nu)-2)}\right)$ with a truncated-Gaussian smoothed hard function in the first case and $\Omega\left(\left(\frac{H}{\sigma}\right)^\frac{2}{3(p+\nu)-2}+ \log^2\left(\frac{\sigma^{p+\nu}}{H^q}\right)^\frac{1}{p+\nu-q}\right)$ in the second case, for reaching an $\epsilon$-approximate solution in terms of the optimality gap. Our analysis generalizes previous lower bounds for functions under first- and second-order smoothness as well as those for uniformly convex functions, and furthermore our results match the corresponding upper bounds in the general setting.
Motivated by the sensitivity-based importance score of the adaptive low-rank adaptation (AdaLoRA), we utilize more theoretically supported metrics, including the signal-to-noise ratio (SNR), along with the Improved Variational Online Newton (IVON) optimizer, for adaptive parameter budget allocation. The resulting Bayesian counterpart not only has matched or surpassed the performance of using the sensitivity-based importance metric but is also a faster alternative to AdaLoRA with Adam. Our theoretical analysis reveals a significant connection between the two metrics, providing a Bayesian perspective on the efficacy of sensitivity as an importance score. Furthermore, our findings suggest that the magnitude, rather than the variance, is the primary indicator of the importance of parameters.
Artificial intelligence models have shown great potential in structure-based drug design, generating ligands with high binding affinities. However, existing models have often overlooked a crucial physical constraint: atoms must maintain a minimum pairwise distance to avoid separation violation, a phenomenon governed by the balance of attractive and repulsive forces. To mitigate such separation violations, we propose NucleusDiff. It models the interactions between atomic nuclei and their surrounding electron clouds by enforcing the distance constraint between the nuclei and manifolds. We quantitatively evaluate NucleusDiff using the CrossDocked2020 dataset and a COVID-19 therapeutic target, demonstrating that NucleusDiff reduces violation rate by up to 100.00% and enhances binding affinity by up to 22.16%, surpassing state-of-the-art models for structure-based drug design. We also provide qualitative analysis through manifold sampling, visually confirming the effectiveness of NucleusDiff in reducing separation violations and improving binding affinities.
The advent of foundation models (FMs) such as large language models (LLMs) has led to a cultural shift in data science, both in medicine and beyond. This shift involves moving away from specialized predictive models trained for specific, well-defined domain questions to generalist FMs pre-trained on vast amounts of unstructured data, which can then be adapted to various clinical tasks and questions. As a result, the standard data science workflow in medicine has been fundamentally altered; the foundation model lifecycle (FMLC) now includes distinct upstream and downstream processes, in which computational resources, model and data access, and decision-making power are distributed among multiple stakeholders. At their core, FMs are fundamentally statistical models, and this new workflow challenges the principles of Veridical Data Science (VDS), hindering the rigorous statistical analysis expected in transparent and scientifically reproducible data science practices. We critically examine the medical FMLC in light of the core principles of VDS: predictability, computability, and stability (PCS), and explain how it deviates from the standard data science workflow. Finally, we propose recommendations for a reimagined medical FMLC that expands and refines the PCS principles for VDS including considering the computational and accessibility constraints inherent to FMs.
In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data, where each token in the Markov chain statistically depends on the previous $n$ tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the induction head mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. In the limiting model, the first attention layer acts as a $\mathit{copier}$, copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a $\mathit{selector}$ that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a $\mathit{classifier}$ that compares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by experiments.