https://papers.cool/arxiv/stat.MLMachine Learning2024-06-14T00:00:00+00:00python-feedgenCool Papers - Immersive Paper Discoveryhttps://papers.cool/arxiv/2406.04535Tangent differential privacy2024-06-14T00:00:00+00:00Lexing YingDifferential privacy is a framework for protecting the identity of individual data points in the decision-making process. In this note, we propose a new form of differential privacy called tangent differential privacy. Compared with the usual differential privacy that is defined uniformly across data distributions, tangent differential privacy is tailored towards a specific data distribution of interest. It also allows for general distribution distances such as total variation distance and Wasserstein distance. In the case of risk minimization, we show that entropic regularization guarantees tangent differential privacy under rather general conditions on the risk function.https://papers.cool/arxiv/2406.08516Enhanced Anomaly Detection in Automotive Systems Using SAAD: Statistical Aggregated Anomaly Detection2024-06-14T00:00:00+00:00Dacian GoinaEduard HogeaGeorge MatiesThis paper presents a novel anomaly detection methodology termed Statistical Aggregated Anomaly Detection (SAAD). The SAAD approach integrates advanced statistical techniques with machine learning, and its efficacy is demonstrated through validation on real sensor data from a Hardware-in-the-Loop (HIL) environment within the automotive domain. The key innovation of SAAD lies in its ability to significantly enhance the accuracy and robustness of anomaly detection when combined with Fully Connected Networks (FCNs) augmented by dropout layers. Comprehensive experimental evaluations indicate that the standalone statistical method achieves an accuracy of 72.1%, whereas the deep learning model alone attains an accuracy of 71.5%. In contrast, the aggregated method achieves a superior accuracy of 88.3% and an F1 score of 0.921, thereby outperforming the individual models. These results underscore the effectiveness of SAAD, demonstrating its potential for broad application in various domains, including automotive systems.https://papers.cool/arxiv/2406.08569Noise-Aware Differentially Private Regression via Meta-Learning2024-06-14T00:00:00+00:00Ossi RäisäStratis MarkouMatthew AshmanWessel P. BruinsmaMarlon TobabenAntti HonkelaRichard E. TurnerMany high-stakes applications require machine learning models that protect user privacy and provide well-calibrated, accurate predictions. While Differential Privacy (DP) is the gold standard for protecting user privacy, standard DP mechanisms typically significantly impair performance. One approach to mitigating this issue is pre-training models on simulated data before DP learning on the private data. In this work we go a step further, using simulated data to train a meta-learning model that combines the Convolutional Conditional Neural Process (ConvCNP) with an improved functional DP mechanism of Hall et al. [2013] yielding the DPConvCNP. DPConvCNP learns from simulated data how to map private data to a DP predictive model in one forward pass, and then provides accurate, well-calibrated predictions. We compare DPConvCNP with a DP Gaussian Process (GP) baseline with carefully tuned hyperparameters. The DPConvCNP outperforms the GP baseline, especially on non-Gaussian data, yet is much faster at test time and requires less tuning.https://papers.cool/arxiv/2406.08666Interventional Causal Discovery in a Mixture of DAGs2024-06-14T00:00:00+00:00Burak VarıcıDmitriy Katz-RogozhnikovDennis WeiPrasanna SattigeriAli TajerCausal interactions among a group of variables are often modeled by a single causal graph. In some domains, however, these interactions are best described by multiple co-existing causal graphs, e.g., in dynamical systems or genomics. This paper addresses the hitherto unknown role of interventions in learning causal interactions among variables governed by a mixture of causal systems, each modeled by one directed acyclic graph (DAG). Causal discovery from mixtures is fundamentally more challenging than single-DAG causal discovery. Two major difficulties stem from (i) inherent uncertainty about the skeletons of the component DAGs that constitute the mixture and (ii) possibly cyclic relationships across these component DAGs. This paper addresses these challenges and aims to identify edges that exist in at least one component DAG of the mixture, referred to as true edges. First, it establishes matching necessary and sufficient conditions on the size of interventions required to identify the true edges. Next, guided by the necessity results, an adaptive algorithm is designed that learns all true edges using ${\cal O}(n^2)$ interventions, where $n$ is the number of nodes. Remarkably, the size of the interventions is optimal if the underlying mixture model does not contain cycles across its components. More generally, the gap between the intervention size used by the algorithm and the optimal size is quantified. It is shown to be bounded by the cyclic complexity number of the mixture model, defined as the size of the minimal intervention that can break the cycles in the mixture, which is upper bounded by the number of cycles among the ancestors of a node.https://papers.cool/arxiv/2406.08748Learning in Feature Spaces via Coupled Covariances: Asymmetric Kernel SVD and Nyström method2024-06-14T00:00:00+00:00Qinghua TaoFrancesco ToninAlex LambertYingyi ChenPanagiotis PatrinosJohan A. K. SuykensIn contrast with Mercer kernel-based approaches as used e.g., in Kernel Principal Component Analysis (KPCA), it was previously shown that Singular Value Decomposition (SVD) inherently relates to asymmetric kernels and Asymmetric Kernel Singular Value Decomposition (KSVD) has been proposed. However, the existing formulation to KSVD cannot work with infinite-dimensional feature mappings, the variational objective can be unbounded, and needs further numerical evaluation and exploration towards machine learning. In this work, i) we introduce a new asymmetric learning paradigm based on coupled covariance eigenproblem (CCE) through covariance operators, allowing infinite-dimensional feature maps. The solution to CCE is ultimately obtained from the SVD of the induced asymmetric kernel matrix, providing links to KSVD. ii) Starting from the integral equations corresponding to a pair of coupled adjoint eigenfunctions, we formalize the asymmetric Nystr\"om method through a finite sample approximation to speed up training. iii) We provide the first empirical evaluations verifying the practical utility and benefits of KSVD and compare with methods resorting to symmetrization or linear SVD across multiple tasks.https://papers.cool/arxiv/2406.08776Learning Joint and Individual Structure in Network Data with Covariates2024-06-14T00:00:00+00:00Carson JamesDongbang YuanIrina GaynanovaJesús ArroyoDatasets consisting of a network and covariates associated with its vertices have become ubiquitous. One problem pertaining to this type of data is to identify information unique to the network, information unique to the vertex covariates and information that is shared between the network and the vertex covariates. Existing techniques for network data and vertex covariates focus on capturing structure that is shared but are usually not able to differentiate structure that is unique to each dataset. This work formulates a low-rank model that simultaneously captures joint and individual information in network data with vertex covariates. A two-step estimation procedure is proposed, composed of an efficient spectral method followed by a refinement optimization step. Theoretically, we show that the spectral method is able to consistently recover the joint and individual components under a general signal-plus-noise model. Simulations and real data examples demonstrate the ability of the methods to recover accurate and interpretable components. In particular, the application of the methodology to a food trade network between countries with economic, developmental and geographical country-level indicators as covariates yields joint and individual factors that explain the trading patterns.https://papers.cool/arxiv/2406.08819AIM: Attributing, Interpreting, Mitigating Data Unfairness2024-06-14T00:00:00+00:00Zhining LiuRuizhong QiuZhichen ZengYada ZhuHendrik HamannHanghang TongData collected in the real world often encapsulates historical discrimination against disadvantaged groups and individuals. Existing fair machine learning (FairML) research has predominantly focused on mitigating discriminative bias in the model prediction, with far less effort dedicated towards exploring how to trace biases present in the data, despite its importance for the transparency and interpretability of FairML. To fill this gap, we investigate a novel research problem: discovering samples that reflect biases/prejudices from the training data. Grounding on the existing fairness notions, we lay out a sample bias criterion and propose practical algorithms for measuring and countering sample bias. The derived bias score provides intuitive sample-level attribution and explanation of historical bias in data. On this basis, we further design two FairML strategies via sample-bias-informed minimal data editing. They can mitigate both group and individual unfairness at the cost of minimal or zero predictive utility loss. Extensive experiments and analyses on multiple real-world datasets demonstrate the effectiveness of our methods in explaining and mitigating unfairness. Code is available at https://github.com/ZhiningLiu1998/AIM.https://papers.cool/arxiv/2406.08884The Penalized Inverse Probability Measure for Conformal Classification2024-06-14T00:00:00+00:00Paul MelkiLionel BombrunBoubacar DialloJérôme DiasJean-Pierre da CostaThe deployment of safe and trustworthy machine learning systems, and particularly complex black box neural networks, in real-world applications requires reliable and certified guarantees on their performance. The conformal prediction framework offers such formal guarantees by transforming any point into a set predictor with valid, finite-set, guarantees on the coverage of the true at a chosen level of confidence. Central to this methodology is the notion of the nonconformity score function that assigns to each example a measure of ''strangeness'' in comparison with the previously seen observations. While the coverage guarantees are maintained regardless of the nonconformity measure, the point predictor and the dataset, previous research has shown that the performance of a conformal model, as measured by its efficiency (the average size of the predicted sets) and its informativeness (the proportion of prediction sets that are singletons), is influenced by the choice of the nonconformity score function. The current work introduces the Penalized Inverse Probability (PIP) nonconformity score, and its regularized version RePIP, that allow the joint optimization of both efficiency and informativeness. Through toy examples and empirical results on the task of crop and weed image classification in agricultural robotics, the current work shows how PIP-based conformal classifiers exhibit precisely the desired behavior in comparison with other nonconformity measures and strike a good balance between informativeness and efficiency.https://papers.cool/arxiv/2406.08918Beyond the Calibration Point: Mechanism Comparison in Differential Privacy2024-06-14T00:00:00+00:00Georgios KaissisStefan KolekBorja BalleJamie HayesDaniel RueckertIn differentially private (DP) machine learning, the privacy guarantees of DP mechanisms are often reported and compared on the basis of a single $(\varepsilon, \delta)$-pair. This practice overlooks that DP guarantees can vary substantially \emph{even between mechanisms sharing a given $(\varepsilon, \delta)$}, and potentially introduces privacy vulnerabilities which can remain undetected. This motivates the need for robust, rigorous methods for comparing DP guarantees in such cases. Here, we introduce the $\Delta$-divergence between mechanisms which quantifies the worst-case excess privacy vulnerability of choosing one mechanism over another in terms of $(\varepsilon, \delta)$, $f$-DP and in terms of a newly presented Bayesian interpretation. Moreover, as a generalisation of the Blackwell theorem, it is endowed with strong decision-theoretic foundations. Through application examples, we show that our techniques can facilitate informed decision-making and reveal gaps in the current understanding of privacy risks, as current practices in DP-SGD often result in choosing mechanisms with high excess privacy vulnerabilities.https://papers.cool/arxiv/2406.08929Step-by-Step Diffusion: An Elementary Tutorial2024-06-14T00:00:00+00:00Preetum NakkiranArwen BradleyHattie ZhouMadhu AdvaniWe present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms.https://papers.cool/arxiv/2406.09023Schur's Positive-Definite Network: Deep Learning in the SPD cone with structure2024-06-14T00:00:00+00:00Can PouliquenMathurin MassiasTitouan VayerEstimating matrices in the symmetric positive-definite (SPD) cone is of interest for many applications ranging from computer vision to graph learning. While there exist various convex optimization-based estimators, they remain limited in expressivity due to their model-based approach. The success of deep learning has thus led many to use neural networks to learn to estimate SPD matrices in a data-driven fashion. For learning structured outputs, one promising strategy involves architectures designed by unrolling iterative algorithms, which potentially benefit from inductive bias properties. However, designing correct unrolled architectures for SPD learning is difficult: they either do not guarantee that their output has all the desired properties, rely on heavy computations, or are overly restrained to specific matrices which hinders their expressivity. In this paper, we propose a novel and generic learning module with guaranteed SPD outputs called SpodNet, that also enables learning a larger class of functions than existing approaches. Notably, it solves the challenging task of learning jointly SPD and sparse matrices. Our experiments demonstrate the versatility of SpodNet layers.https://papers.cool/arxiv/2406.09049Efficiently Deciding Algebraic Equivalence of Bow-Free Acyclic Path Diagrams2024-06-14T00:00:00+00:00Thijs van OmmenFor causal discovery in the presence of latent confounders, constraints beyond conditional independences exist that can enable causal discovery algorithms to distinguish more pairs of graphs. Such constraints are not well-understood yet. In the setting of linear structural equation models without bows, we study algebraic constraints and argue that these provide the most fine-grained resolution achievable. We propose efficient algorithms that decide whether two graphs impose the same algebraic constraints, or whether the constraints imposed by one graph are a subset of those imposed by another graph.https://papers.cool/arxiv/2406.09069On the Robustness of Global Feature Effect Explanations2024-06-14T00:00:00+00:00Hubert BanieckiGiuseppe CasalicchioBernd BischlPrzemyslaw BiecekWe study the robustness of global post-hoc explanations for predictive models trained on tabular data. Effects of predictor features in black-box supervised learning are an essential diagnostic tool for model debugging and scientific discovery in applied sciences. However, how vulnerable they are to data and model perturbations remains an open research question. We introduce several theoretical bounds for evaluating the robustness of partial dependence plots and accumulated local effects. Our experimental results with synthetic and real-world datasets quantify the gap between the best and worst-case scenarios of (mis)interpreting machine learning predictions globally.https://papers.cool/arxiv/2406.09116Injective Flows for parametric hypersurfaces2024-06-14T00:00:00+00:00Marcello Massimo NegriJonathan AellenVolker RothNormalizing Flows (NFs) are powerful and efficient models for density estimation. When modeling densities on manifolds, NFs can be generalized to injective flows but the Jacobian determinant becomes computationally prohibitive. Current approaches either consider bounds on the log-likelihood or rely on some approximations of the Jacobian determinant. In contrast, we propose injective flows for parametric hypersurfaces and show that for such manifolds we can compute the Jacobian determinant exactly and efficiently, with the same cost as NFs. Furthermore, we show that for the subclass of star-like manifolds we can extend the proposed framework to always allow for a Cartesian representation of the density. We showcase the relevance of modeling densities on hypersurfaces in two settings. Firstly, we introduce a novel Objective Bayesian approach to penalized likelihood models by interpreting level-sets of the penalty as star-like manifolds. Secondly, we consider Bayesian mixture models and introduce a general method for variational inference by defining the posterior of mixture weights on the probability simplex.https://papers.cool/arxiv/2406.09241What is the long-run distribution of stochastic gradient descent? A large deviations analysis2024-06-14T00:00:00+00:00Waïss AzizianFranck IutzelerJérôme MalickPanayotis MertikopoulosIn this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem's state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem's critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem's minimum energy state (which does not always coincide with the global minimum of the objective); (c) all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally (d) any component of local maximizers or saddle points is "dominated" by a component of local minimizers which is visited exponentially more often.https://papers.cool/arxiv/2406.09307A tutorial on fairness in machine learning in healthcare2024-06-14T00:00:00+00:00Jianhui GaoBenson ChouZachary R. McCawHilary ThurstonPaul VargheseChuan HongJessica GronsbellOBJECTIVE: Ensuring that machine learning (ML) algorithms are safe and effective within all patient groups, and do not disadvantage particular patients, is essential to clinical decision making and preventing the reinforcement of existing healthcare inequities. The objective of this tutorial is to introduce the medical informatics community to the common notions of fairness within ML, focusing on clinical applications and implementation in practice. TARGET AUDIENCE: As gaps in fairness arise in a variety of healthcare applications, this tutorial is designed to provide an understanding of fairness, without assuming prior knowledge, to researchers and clinicians who make use of modern clinical data. SCOPE: We describe the fundamental concepts and methods used to define fairness in ML, including an overview of why models in healthcare may be unfair, a summary and comparison of the metrics used to quantify fairness, and a discussion of some ongoing research. We illustrate some of the fairness methods introduced through a case study of mortality prediction in a publicly available electronic health record dataset. Finally, we provide a user-friendly R package for comprehensive group fairness evaluation, enabling researchers and clinicians to assess fairness in their own ML work.https://papers.cool/arxiv/2406.09347Separations in the Representational Capabilities of Transformers and Recurrent Architectures2024-06-14T00:00:00+00:00Satwik BhattamishraMichael HahnPhil BlunsomVarun KanadeTransformer architectures have been widely adopted in foundation models. Due to their high inference costs, there is renewed interest in exploring the potential of efficient recurrent architectures (RNNs). In this paper, we analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance, including index lookup, nearest neighbor, recognizing bounded Dyck languages, and string equality. For the tasks considered, our results show separations based on the size of the model required for different architectures. For example, we show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size. Conversely, while constant-size RNNs can recognize bounded Dyck languages, we show that one-layer Transformers require a linear size for this task. Furthermore, we show that two-layer Transformers of logarithmic size can perform decision tasks such as string equality or disjointness, whereas both one-layer Transformers and recurrent models require linear size for these tasks. We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass; on the other hand recurrent models require linear size. Our constructions are based on the existence of $N$ nearly orthogonal vectors in $O(\log N)$ dimensional space and our lower bounds are based on reductions from communication complexity problems. We supplement our theoretical results with experiments that highlight the differences in the performance of these architectures on practical-size sequences.https://papers.cool/arxiv/2406.09357Advancing Graph Generation through Beta Diffusion2024-06-14T00:00:00+00:00Yilin HeXinyang LiuBo ChenMingyuan ZhouDiffusion models have demonstrated effectiveness in generating natural images and have been extended to generate diverse data types, including graphs. This new generation of diffusion-based graph generative models has demonstrated significant performance improvements over methods that rely on variational autoencoders or generative adversarial networks. It's important to recognize, however, that most of these models employ Gaussian or categorical diffusion processes, which can struggle with sparse and long-tailed data distributions. In our work, we introduce Graph Beta Diffusion (GBD), a diffusion-based generative model particularly adept at capturing diverse graph structures. GBD utilizes a beta diffusion process, tailored for the sparse and range-bounded characteristics of graph adjacency matrices. Furthermore, we have developed a modulation technique that enhances the realism of the generated graphs by stabilizing the generation of critical graph structures, while preserving flexibility elsewhere. The outstanding performance of GBD across three general graph benchmarks and two biochemical graph benchmarks highlights its capability to effectively capture the complexities of real-world graph data. The code will be made available at https://github.com/YH-UtMSB/Graph_Beta_Diffusionhttps://papers.cool/arxiv/2406.09405Why Warmup the Learning Rate? Underlying Mechanisms and Improvements2024-06-14T00:00:00+00:00Dayal Singh KalraMaissam BarkeshliIt is common in deep learning to warm up the learning rate $\eta$, often by a linear schedule between $\eta_{\text{init}} = 0$ and a predetermined target $\eta_{\text{trgt}}$. In this paper, we show through systematic experiments using SGD and Adam that the overwhelming benefit of warmup arises from allowing the network to tolerate larger $\eta_{\text{trgt}}$ by forcing the network to more well-conditioned areas of the loss landscape. The ability to handle larger $\eta_{\text{trgt}}$ makes hyperparameter tuning more robust while improving the final performance. We uncover different regimes of operation during the warmup period, depending on whether training starts off in a progressive sharpening or sharpness reduction phase, which in turn depends on the initialization and parameterization. Using these insights, we show how $\eta_{\text{init}}$ can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup. We also suggest an initialization for the variance in Adam which provides benefits similar to warmup.https://papers.cool/arxiv/2406.08654Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization2024-06-14T00:00:00+00:00Yuhang CaiJingfeng WuSong MeiMichael LindseyPeter L. BartlettThe typical training of neural networks using large stepsize gradient descent (GD) under the logistic loss often involves two distinct phases, where the empirical risk oscillates in the first phase but decreases monotonically in the second phase. We investigate this phenomenon in two-layer networks that satisfy a near-homogeneity condition. We show that the second phase begins once the empirical risk falls below a certain threshold, dependent on the stepsize. Additionally, we show that the normalized margin grows nearly monotonically in the second phase, demonstrating an implicit bias of GD in training non-homogeneous predictors. If the dataset is linearly separable and the derivative of the activation function is bounded away from zero, we show that the average empirical risk decreases, implying that the first phase must stop in finite steps. Finally, we demonstrate that by choosing a suitably large stepsize, GD that undergoes this phase transition is more efficient than GD that monotonically decreases the risk. Our analysis applies to networks of any width, beyond the well-known neural tangent kernel and mean-field regimes.https://papers.cool/arxiv/2406.08658Pruning is Optimal for Learning Sparse Features in High-Dimensions2024-06-14T00:00:00+00:00Nuri Mert VuralMurat A. ErdogduWhile it is commonly observed in practice that pruning networks to a certain level of sparsity can improve the quality of the features, a theoretical explanation of this phenomenon remains elusive. In this work, we investigate this by demonstrating that a broad class of statistical models can be optimally learned using pruned neural networks trained with gradient descent, in high-dimensions. We consider learning both single-index and multi-index models of the form $y = \sigma^*(\boldsymbol{V}^{\top} \boldsymbol{x}) + \epsilon$, where $\sigma^*$ is a degree-$p$ polynomial, and $\boldsymbol{V} \in \mathbbm{R}^{d \times r}$ with $r \ll d$, is the matrix containing relevant model directions. We assume that $\boldsymbol{V}$ satisfies a certain $\ell_q$-sparsity condition for matrices and show that pruning neural networks proportional to the sparsity level of $\boldsymbol{V}$ improves their sample complexity compared to unpruned networks. Furthermore, we establish Correlational Statistical Query (CSQ) lower bounds in this setting, which take the sparsity level of $\boldsymbol{V}$ into account. We show that if the sparsity level of $\boldsymbol{V}$ exceeds a certain threshold, training pruned networks with a gradient descent algorithm achieves the sample complexity suggested by the CSQ lower bound. In the same scenario, however, our results imply that basis-independent methods such as models trained via standard gradient descent initialized with rotationally invariant random weights can provably achieve only suboptimal sample complexity.https://papers.cool/arxiv/2406.08697Orthogonalized Estimation of Difference of $Q$-functions2024-06-14T00:00:00+00:00Angela ZhouOffline reinforcement learning is important in many settings with available observational data but the inability to deploy new policies online due to safety, cost, and other concerns. Many recent advances in causal inference and machine learning target estimation of causal contrast functions such as CATE, which is sufficient for optimizing decisions and can adapt to potentially smoother structure. We develop a dynamic generalization of the R-learner (Nie and Wager 2021, Lewis and Syrgkanis 2021) for estimating and optimizing the difference of $Q^\pi$-functions, $Q^\pi(s,1)-Q^\pi(s,0)$ (which can be used to optimize multiple-valued actions). We leverage orthogonal estimation to improve convergence rates in the presence of slower nuisance estimation rates and prove consistency of policy optimization under a margin condition. The method can leverage black-box nuisance estimators of the $Q$-function and behavior policy to target estimation of a more structured $Q$-function contrast.https://papers.cool/arxiv/2406.08853Assessment of Uncertainty Quantification in Universal Differential Equations2024-06-14T00:00:00+00:00Nina SchmidDavid Fernandes del PozoWillem WaegemanJan HasenauerScientific Machine Learning is a new class of approaches that integrate physical knowledge and mechanistic models with data-driven techniques for uncovering governing equations of complex processes. Among the available approaches, Universal Differential Equations (UDEs) are used to combine prior knowledge in the form of mechanistic formulations with universal function approximators, like neural networks. Integral to the efficacy of UDEs is the joint estimation of parameters within mechanistic formulations and the universal function approximators using empirical data. The robustness and applicability of resultant models, however, hinge upon the rigorous quantification of uncertainties associated with these parameters, as well as the predictive capabilities of the overall model or its constituent components. With this work, we provide a formalisation of uncertainty quantification (UQ) for UDEs and investigate important frequentist and Bayesian methods. By analysing three synthetic examples of varying complexity, we evaluate the validity and efficiency of ensembles, variational inference and Markov chain Monte Carlo sampling as epistemic UQ methods for UDEs.https://papers.cool/arxiv/2406.09048Central Limit Theorem for Bayesian Neural Network trained with Variational Inference2024-06-14T00:00:00+00:00Arnaud DescoursTom HuixArnaud GuillinManon MichelÉric MoulinesBoris NectouxIn this paper, we rigorously derive Central Limit Theorems (CLT) for Bayesian two-layerneural networks in the infinite-width limit and trained by variational inference on a regression task. The different networks are trained via different maximization schemes of the regularized evidence lower bound: (i) the idealized case with exact estimation of a multiple Gaussian integral from the reparametrization trick, (ii) a minibatch scheme using Monte Carlo sampling, commonly known as Bayes-by-Backprop, and (iii) a computationally cheaper algorithm named Minimal VI. The latter was recently introduced by leveraging the information obtained at the level of the mean-field limit. Laws of large numbers are already rigorously proven for the three schemes that admits the same asymptotic limit. By deriving CLT, this work shows that the idealized and Bayes-by-Backprop schemes have similar fluctuation behavior, that is different from the Minimal VI one. Numerical experiments then illustrate that the Minimal VI scheme is still more efficient, in spite of bigger variances, thanks to its important gain in computational complexity.https://papers.cool/arxiv/2406.09051Bayesian Structural Model Updating with Multimodal Variational Autoencoder2024-06-14T00:00:00+00:00Tatsuya ItoiKazuho AmishikiSangwon LeeTaro YaoyamaThis paper presents a novel framework for Bayesian structural model updating and proposes a method that utilizes the surrogate unimodal encoders of a multimodal variational autoencoder. This method facilitates an efficient nonparametric estimation of the likelihood describing the observed data. It is particularly suitable for high-dimensional correlated simultaneous observations applicable to various dynamic analysis models. The proposed approach is benchmarked using a numerical model of a single-story frame building with acceleration and dynamic strain measurements.https://papers.cool/arxiv/2406.09084Operator-informed score matching for Markov diffusion models2024-06-14T00:00:00+00:00Zheyang ShenChris J. OatesDiffusion models are typically trained using score matching, yet score matching is agnostic to the particular forward process that defines the model. This paper argues that Markov diffusion models enjoy an advantage over other types of diffusion model, as their associated operators can be exploited to improve the training process. In particular, (i) there exists an explicit formal solution to the forward process as a sequence of time-dependent kernel mean embeddings; and (ii) the derivation of score-matching and related estimators can be streamlined. Building upon (i), we propose Riemannian diffusion kernel smoothing, which ameliorates the need for neural score approximation, at least in the low-dimensional context; Building upon (ii), we propose operator-informed score matching, a variance reduction technique that is straightforward to implement in both low- and high-dimensional diffusion modeling and is demonstrated to improve score matching in an empirical proof-of-concept.https://papers.cool/arxiv/2406.09172Generative vs. Discriminative modeling under the lens of uncertainty quantification2024-06-14T00:00:00+00:00Elouan Argouarc'hFrançois DesbouvriesEric BaratEiji KawasakiLearning a parametric model from a given dataset indeed enables to capture intrinsic dependencies between random variables via a parametric conditional probability distribution and in turn predict the value of a label variable given observed variables. In this paper, we undertake a comparative analysis of generative and discriminative approaches which differ in their construction and the structure of the underlying inference problem. Our objective is to compare the ability of both approaches to leverage information from various sources in an epistemic uncertainty aware inference via the posterior predictive distribution. We assess the role of a prior distribution, explicit in the generative case and implicit in the discriminative case, leading to a discussion about discriminative models suffering from imbalanced dataset. We next examine the double role played by the observed variables in the generative case, and discuss the compatibility of both approaches with semi-supervised learning. We also provide with practical insights and we examine how the modeling choice impacts the sampling from the posterior predictive distribution. With regard to this, we propose a general sampling scheme enabling supervised learning for both approaches, as well as semi-supervised learning when compatible with the considered modeling approach. Throughout this paper, we illustrate our arguments and conclusions using the example of affine regression, and validate our comparative analysis through classification simulations using neural network based models.https://papers.cool/arxiv/2406.09177Scalable and Flexible Causal Discovery with an Efficient Test for Adjacency2024-06-14T00:00:00+00:00Alan Nawzad AminAndrew Gordon WilsonTo make accurate predictions, understand mechanisms, and design interventions in systems of many variables, we wish to learn causal graphs from large scale data. Unfortunately the space of all possible causal graphs is enormous so scalably and accurately searching for the best fit to the data is a challenge. In principle we could substantially decrease the search space, or learn the graph entirely, by testing the conditional independence of variables. However, deciding if two variables are adjacent in a causal graph may require an exponential number of tests. Here we build a scalable and flexible method to evaluate if two variables are adjacent in a causal graph, the Differentiable Adjacency Test (DAT). DAT replaces an exponential number of tests with a provably equivalent relaxed problem. It then solves this problem by training two neural networks. We build a graph learning method based on DAT, DAT-Graph, that can also learn from data with interventions. DAT-Graph can learn graphs of 1000 variables with state of the art accuracy. Using the graph learned by DAT-Graph, we also build models that make much more accurate predictions of the effects of interventions on large scale RNA sequencing data.https://papers.cool/arxiv/2406.09183Ridge interpolators in correlated factor regression models -- exact risk analysis2024-06-14T00:00:00+00:00Mihailo StojnicWe consider correlated \emph{factor} regression models (FRM) and analyze the performance of classical ridge interpolators. Utilizing powerful \emph{Random Duality Theory} (RDT) mathematical engine, we obtain \emph{precise} closed form characterizations of the underlying optimization problems and all associated optimizing quantities. In particular, we provide \emph{excess prediction risk} characterizations that clearly show the dependence on all key model parameters, covariance matrices, loadings, and dimensions. As a function of the over-parametrization ratio, the generalized least squares (GLS) risk also exhibits the well known \emph{double-descent} (non-monotonic) behavior. Similarly to the classical linear regression models (LRM), we demonstrate that such FRM phenomenon can be smoothened out by the optimally tuned ridge regularization. The theoretical results are supplemented by numerical simulations and an excellent agrement between the two is observed. Moreover, we note that ``ridge smootenhing'' is often of limited effect already for over-parametrization ratios above $5$ and of virtually no effect for those above $10$. This solidifies the notion that one of the recently most popular neural networks paradigms -- \emph{zero-training (interpolating) generalizes well} -- enjoys wider applicability, including the one within the FRM estimation/prediction context.https://papers.cool/arxiv/2406.09194Bengining overfitting in Fixed Dimension via Physics-Informed Learning with Smooth Iductive Bias2024-06-14T00:00:00+00:00Honam WongWendao WuFanghui LiuYiping LuRecent advances in machine learning theory showed that interpolation to noisy samples using over-parameterized machine learning algorithms always leads to inconsistency. However, this work surprisingly discovers that interpolated machine learning can exhibit benign overfitting and consistency when using physics-informed learning for supervised tasks governed by partial differential equations (PDEs) describing laws of physics. An analysis provides an asymptotic Sobolev norm learning curve for kernel ridge(less) regression addressing linear inverse problems involving elliptic PDEs. The results reveal that the PDE operators can stabilize variance and lead to benign overfitting for fixed-dimensional problems, contrasting standard regression settings. The impact of various inductive biases introduced by minimizing different Sobolev norms as implicit regularization is also examined. Notably, the convergence rate is independent of the specific (smooth) inductive bias for both ridge and ridgeless regression. For regularized least squares estimators, all (smooth enough) inductive biases can achieve optimal convergence rates when the regularization parameter is properly chosen. The smoothness requirement recovers a condition previously found in the Bayesian setting and extends conclusions to minimum norm interpolation estimators.https://papers.cool/arxiv/2406.09199Precise analysis of ridge interpolators under heavy correlations -- a Random Duality Theory view2024-06-14T00:00:00+00:00Mihailo StojnicWe consider fully row/column-correlated linear regression models and study several classical estimators (including minimum norm interpolators (GLS), ordinary least squares (LS), and ridge regressors). We show that \emph{Random Duality Theory} (RDT) can be utilized to obtain precise closed form characterizations of all estimators related optimizing quantities of interest, including the \emph{prediction risk} (testing or generalization error). On a qualitative level out results recover the risk's well known non-monotonic (so-called double-descent) behavior as the number of features/sample size ratio increases. On a quantitative level, our closed form results show how the risk explicitly depends on all key model parameters, including the problem dimensions and covariance matrices. Moreover, a special case of our results, obtained when intra-sample (or time-series) correlations are not present, precisely match the corresponding ones obtained via spectral methods in [6,16,17,24].https://papers.cool/arxiv/2406.09253Deep Sketched Output Kernel Regression for Structured Prediction2024-06-14T00:00:00+00:00Tamim El AhmadJunjie YangPierre LaforgueFlorence d'Alché-BucBy leveraging the kernel trick in the output space, kernel-induced losses provide a principled way to define structured output prediction tasks for a wide variety of output modalities. In particular, they have been successfully used in the context of surrogate non-parametric regression, where the kernel trick is typically exploited in the input space as well. However, when inputs are images or texts, more expressive models such as deep neural networks seem more suited than non-parametric methods. In this work, we tackle the question of how to train neural networks to solve structured output prediction tasks, while still benefiting from the versatility and relevance of kernel-induced losses. We design a novel family of deep neural architectures, whose last layer predicts in a data-dependent finite-dimensional subspace of the infinite-dimensional output feature space deriving from the kernel-induced loss. This subspace is chosen as the span of the eigenfunctions of a randomly-approximated version of the empirical kernel covariance operator. Interestingly, this approach unlocks the use of gradient descent algorithms (and consequently of any neural architecture) for structured prediction. Experiments on synthetic tasks as well as real-world supervised graph prediction problems show the relevance of our method.https://papers.cool/arxiv/2406.09375Learning conditional distributions on continuous spaces2024-06-14T00:00:00+00:00Cyril BénézetZiteng ChengSebastian JaimungalWe investigate sample-based learning of conditional distributions on multi-dimensional unit boxes, allowing for different dimensions of the feature and target spaces. Our approach involves clustering data near varying query points in the feature space to create empirical measures in the target space. We employ two distinct clustering schemes: one based on a fixed-radius ball and the other on nearest neighbors. We establish upper bounds for the convergence rates of both methods and, from these bounds, deduce optimal configurations for the radius and the number of neighbors. We propose to incorporate the nearest neighbors method into neural network training, as our empirical analysis indicates it has better performance in practice. For efficiency, our training process utilizes approximate nearest neighbors search with random binary space partitioning. Additionally, we employ the Sinkhorn algorithm and a sparsity-enforced transport plan. Our empirical findings demonstrate that, with a suitably designed structure, the neural network has the ability to adapt to a suitable level of Lipschitz continuity locally. For reproducibility, our code is available at \url{https://github.com/zcheng-a/LCD_kNN}.https://papers.cool/arxiv/2406.09387Oblivious subspace embeddings for compressed Tucker decompositions2024-06-14T00:00:00+00:00Matthew PietrosanuBei JiangLinglong KongEmphasis in the tensor literature on random embeddings (tools for low-distortion dimension reduction) for the canonical polyadic (CP) tensor decomposition has left analogous results for the more expressive Tucker decomposition comparatively lacking. This work establishes general Johnson-Lindenstrauss (JL) type guarantees for the estimation of Tucker decompositions when an oblivious random embedding is applied along each mode. When these embeddings are drawn from a JL-optimal family, the decomposition can be estimated within $\varepsilon$ relative error under restrictions on the embedding dimension that are in line with recent CP results. We implement a higher-order orthogonal iteration (HOOI) decomposition algorithm with random embeddings to demonstrate the practical benefits of this approach and its potential to improve the accessibility of otherwise prohibitive tensor analyses. On moderately large face image and fMRI neuroimaging datasets, empirical results show that substantial dimension reduction is possible with minimal increase in reconstruction error relative to traditional HOOI ($\leq$5% larger error, 50%-60% lower computation time for large models with 50% dimension reduction along each mode). Especially for large tensors, our method outperforms traditional higher-order singular value decomposition (HOSVD) and recently proposed TensorSketch methods.