2025-03-28 | | Total: 57
We used contrastive neural networks to learn useful similarity scores between the 144 cartridge casings in the NBIDE dataset, under the common-but-unknown source paradigm. The common-but-unknown source problem is a problem archetype in forensics where the question is whether two objects share a common source (e.g. were two cartridge casings fired from the same firearm). Similarity scores are often used to interpret evidence under this paradigm. We directly compared our results to a state-of-the-art algorithm, Congruent Matching Cells (CMC). When trained on the E3 dataset of 2967 cartridge casings, contrastive learning achieved an ROC AUC of 0.892. The CMC algorithm achieved 0.867. We also conducted an ablation study where we varied the neural network architecture; specifically, the network's width or depth. The ablation study showed that contrastive network performance results are somewhat robust to the network architecture. This work was in part motivated by the use of similarity scores attained via contrastive learning for standard evidence interpretation methods such as score-based likelihood ratios.
We develop new uncertainty propagation methods for feed-forward neural network architectures with leaky ReLu activation functions subject to random perturbations in the input vectors. In particular, we derive analytical expressions for the probability density function (PDF) of the neural network output and its statistical moments as a function of the input uncertainty and the parameters of the network, i.e., weights and biases. A key finding is that an appropriate linearization of the leaky ReLu activation function yields accurate statistical results even for large perturbations in the input vectors. This can be attributed to the way information propagates through the network. We also propose new analytically tractable Gaussian copula surrogate models to approximate the full joint PDF of the neural network output. To validate our theorical results, we conduct Monte Carlo simulations and a thorough error analysis on a multi-layer neural network representing a nonlinear integro-differential operator between two polynomial function spaces. Our findings demonstrate excellent agreement between the theoretical predictions and Monte Carlo simulations.
We propose a framework to convert $(\varepsilon, \delta)$-approximate Differential Privacy (DP) mechanisms into $(\varepsilon, 0)$-pure DP mechanisms, a process we call ``purification''. This algorithmic technique leverages randomized post-processing with calibrated noise to eliminate the $\delta$ parameter while preserving utility. By combining the tighter utility bounds and computational efficiency of approximate DP mechanisms with the stronger guarantees of pure DP, our approach achieves the best of both worlds. We illustrate the applicability of this framework in various settings, including Differentially Private Empirical Risk Minimization (DP-ERM), data-dependent DP mechanisms such as Propose-Test-Release (PTR), and query release tasks. To the best of our knowledge, this is the first work to provide a systematic method for transforming approximate DP into pure DP while maintaining competitive accuracy and computational efficiency.
Accelerating the discovery and manufacturing of advanced materials with specific properties is a critical yet formidable challenge due to vast search space, high costs of experiments, and time-intensive nature of material characterization. In recent years, active learning, where a surrogate machine learning (ML) model mimics the scientific discovery process of a human scientist, has emerged as a promising approach to address these challenges by guiding experimentation toward high-value outcomes with a limited budget. Among the diverse active learning philosophies, the concept of surprise (capturing the divergence between expected and observed outcomes) has demonstrated significant potential to drive experimental trials and refine predictive models. Scientific discovery often stems from surprise thereby making it a natural driver to guide the search process. Despite its promise, prior studies leveraging surprise metrics such as Shannon and Bayesian surprise lack mechanisms to account for prior confidence, leading to excessive exploration of uncertain regions that may not yield useful information. To address this, we propose the Confidence-Adjusted Surprise Measure for Active Resourceful Trials (CA-SMART), a novel Bayesian active learning framework tailored for optimizing data-driven experimentation. On a high level, CA-SMART incorporates Confidence-Adjusted Surprise (CAS) to dynamically balance exploration and exploitation by amplifying surprises in regions where the model is more certain while discounting them in highly uncertain areas. We evaluated CA-SMART on two benchmark functions (Six-Hump Camelback and Griewank) and in predicting the fatigue strength of steel. The results demonstrate superior accuracy and efficiency compared to traditional surprise metrics, standard Bayesian Optimization (BO) acquisition functions and conventional ML methods.
Evaluation is critical to advance decision making across domains, yet existing methodologies often struggle to balance theoretical rigor and practical scalability. In order to reduce the cost of experimental evaluation, we introduce a computational theory of evaluation for parameterisable subjects. We prove upper bounds of generalized evaluation error and generalized causal effect error of evaluation metric on subject. We also prove efficiency, and consistency to estimated causal effect of subject on metric by prediction. To optimize evaluation models, we propose a meta-learner to handle heterogeneous evaluation subjects space. Comparing with other computational approaches, our (conditional) evaluation model reduced 24.1%-99.0% evaluation errors across 12 scenes, including individual medicine, scientific simulation, business activities, and quantum trade. The evaluation time is reduced 3-7 order of magnitude comparing with experiments or simulations.
Context: Astronomy and astrophysics demand rigorous handling of uncertainties to ensure the credibility of outcomes. The growing integration of artificial intelligence offers a novel avenue to address this necessity. This convergence presents an opportunity to create advanced models capable of quantifying diverse sources of uncertainty and automating complex data relationship exploration. What: We introduce a hierarchical Bayesian architecture whose probabilistic relationships are modeled by neural networks, designed to forecast stellar attributes such as mass, radius, and age (our main target). This architecture handles both observational uncertainties stemming from measurements and epistemic uncertainties inherent in the predictive model itself. As a result, our system generates distributions that encapsulate the potential range of values for our predictions, providing a comprehensive understanding of their variability and robustness. Methods: Our focus is on dating main sequence stars using a technique known as Chemical Clocks, which serves as both our primary astronomical challenge and a model prototype. In this work, we use hierarchical architectures to account for correlations between stellar parameters and optimize information extraction from our dataset. We also employ Bayesian neural networks for their versatility and flexibility in capturing complex data relationships. Results: By integrating our machine learning algorithm into a Bayesian framework, we have successfully propagated errors consistently and managed uncertainty treatment effectively, resulting in predictions characterized by broader uncertainty margins. This approach facilitates more conservative estimates in stellar dating. Our architecture achieves age predictions with a mean absolute error of less than 1 Ga for the stars in the test dataset.
Designing efficient learning algorithms with complexity guarantees for Markov decision processes (MDPs) with large or continuous state and action spaces remains a fundamental challenge. We address this challenge for entropy-regularized MDPs with Polish state and action spaces, assuming access to a generative model of the environment. We propose a novel family of multilevel Monte Carlo (MLMC) algorithms that integrate fixed-point iteration with MLMC techniques and a generic stochastic approximation of the Bellman operator. We quantify the precise impact of the chosen approximate Bellman operator on the accuracy of the resulting MLMC estimator. Leveraging this error analysis, we show that using a biased plain MC estimate for the Bellman operator results in quasi-polynomial sample complexity, whereas an unbiased randomized multilevel approximation of the Bellman operator achieves polynomial sample complexity in expectation. Notably, these complexity bounds are independent of the dimensions or cardinalities of the state and action spaces, distinguishing our approach from existing algorithms whose complexities scale with the sizes of these spaces. We validate these theoretical performance guarantees through numerical experiments.
Accurate patient mortality prediction enables effective risk stratification, leading to personalized treatment plans and improved patient outcomes. However, predicting mortality in healthcare remains a significant challenge, with existing studies often focusing on specific diseases or limited predictor sets. This study evaluates machine learning models for all-cause in-hospital mortality prediction using the MIMIC-III database, employing a comprehensive feature engineering approach. Guided by clinical expertise and literature, we extracted key features such as vital signs (e.g., heart rate, blood pressure), laboratory results (e.g., creatinine, glucose), and demographic information. The Random Forest model achieved the highest performance with an AUC of 0.94, significantly outperforming other machine learning and deep learning approaches. This demonstrates Random Forest's robustness in handling high-dimensional, noisy clinical data and its potential for developing effective clinical decision support tools. Our findings highlight the importance of careful feature engineering for accurate mortality prediction. We conclude by discussing implications for clinical adoption and propose future directions, including enhancing model robustness and tailoring prediction models for specific diseases.
This work deals with the generation of theoretical correlation matrices with specific sparsity patterns, associated to graph structures. We present a novel approach based on convex optimization, offering greater flexibility compared to existing techniques, notably by controlling the mean of the entry distribution in the generated correlation matrices. This allows for the generation of correlation matrices that better represent realistic data and can be used to benchmark statistical methods for graph inference.
Oceanic processes at fine scales are crucial yet difficult to observe accurately due to limitations in satellite and in-situ measurements. The Surface Water and Ocean Topography (SWOT) mission provides high-resolution Sea Surface Height (SSH) data, though noise patterns often obscure fine scale structures. Current methods struggle with noisy data or require extensive supervised training, limiting their effectiveness on real-world observations. We introduce SIMPGEN (Simulation-Informed Metric and Prior for Generative Ensemble Networks), an unsupervised adversarial learning framework combining real SWOT observations with simulated reference data. SIMPGEN leverages wavelet-informed neural metrics to distinguish noisy from clean fields, guiding realistic SSH reconstructions. Applied to SWOT data, SIMPGEN effectively removes noise, preserving fine-scale features better than existing neural methods. This robust, unsupervised approach not only improves SWOT SSH data interpretation but also demonstrates strong potential for broader oceanographic applications, including data assimilation and super-resolution.
Many Monte Carlo (MC) and importance sampling (IS) methods use mixture models (MMs) for their simplicity and ability to capture multimodal distributions. Recently, subtractive mixture models (SMMs), i.e. MMs with negative coefficients, have shown greater expressiveness and success in generative modeling. However, their negative parameters complicate sampling, requiring costly auto-regressive techniques or accept-reject algorithms that do not scale in high dimensions. In this work, we use the difference representation of SMMs to construct an unbiased IS estimator ($\Delta\text{Ex}$) that removes the need to sample from the SMM, enabling high-dimensional expectation estimation with SMMs. In our experiments, we show that $\Delta\text{Ex}$ can achieve comparable estimation quality to auto-regressive sampling while being considerably faster in MC estimation. Moreover, we conduct initial experiments with $\Delta\text{Ex}$ using hand-crafted proposals, gaining first insights into how to construct safe proposals for $\Delta\text{Ex}$.
Large language models afford opportunities for using computers for intensive tasks, realizing research opportunities that have not been considered before. One such opportunity could be a systematic interrogation of the scientific literature. Here, we show how a large language model can be used to construct a literature review of 2699 publications associated with microphysics parametrizations in the Weather and Research Forecasting (WRF) model, with the goal of learning how they were used and their systematic biases, when simulating precipitation. The database was constructed of publications identified from Web of Science and Scopus searches. The large language model GPT-4 Turbo was used to extract information about model configurations and performance from the text of 2699 publications. Our results reveal the landscape of how nine of the most popular microphysics parameterizations have been used around the world: Lin, Ferrier, WRF Single-Moment, Goddard Cumulus Ensemble, Morrison, Thompson, and WRF Double-Moment. More studies used one-moment parameterizations before 2020 and two-moment parameterizations after 2020. Seven out of nine parameterizations tended to overestimate precipitation. However, systematic biases of parameterizations differed in various regions. Except simulations using the Lin, Ferrier, and Goddard parameterizations that tended to underestimate precipitation over almost all locations, the remaining six parameterizations tended to overestimate, particularly over China, southeast Asia, western United States, and central Africa. This method could be used by other researchers to help understand how the increasingly massive body of scientific literature can be harnessed through the power of artificial intelligence to solve their research problems.
Out-of-distribution (OOD) detection in deep learning has traditionally been framed as a binary task, where samples are either classified as belonging to the known classes or marked as OOD, with little attention given to the semantic relationships between OOD samples and the in-distribution (ID) classes. We propose a framework for detecting and classifying OOD samples in a given class hierarchy. Specifically, we aim to predict OOD data to their correct internal nodes of the class hierarchy, whereas the known ID classes should be predicted as their corresponding leaf nodes. Our approach leverages the class hierarchy to create a probabilistic model and we implement this model by using networks trained for ID classification at multiple hierarchy depths. We conduct experiments on three datasets with predefined class hierarchies and show the effectiveness of our method. Our code is available at https://github.com/walline/prohoc.
We recently proposed a method for estimation of states and parameters in stochastic differential equations, which included intermediate time points between observations and used the Laplace approximation to integrate out these intermediate states. In this paper, we establish a Laplace approximation for the transition probabilities in the continuous-time limit where the computational time step between intermediate states vanishes. Our technique views the driving Brownian motion as a control, casts the problem as one of minimum effort control between two states, and employs a Girsanov shift of probability measure as well as a weak noise approximation to obtain the Laplace approximation. We demonstrate the technique with examples; one where the approximation is exact due to a property of coordinate transforms, and one where contributions from non-near paths impair the approximation. We assess the order of discrete-time scheme, and demonstrate the Strang splitting leads to higher order and accuracy than Euler-type discretization. Finally, we investigate numerically how the accuracy of the approximation depends on the noise intensity and the length of the time interval.
We consider the problem of constructing a least conservative estimator of the expected value $\mu$ of a non-negative heavy-tailed random variable. We require that the probability of overestimating the expected value $\mu$ is kept appropriately small; a natural requirement if its subsequent use in a decision process is anticipated. In this setting, we show it is optimal to estimate $\mu$ by solving a distributionally robust optimization (DRO) problem using the Kullback-Leibler (KL) divergence. We further show that the statistical properties of KL-DRO compare favorably with other estimators based on truncation, variance regularization, or Wasserstein DRO.
A novel and intuitive nearest neighbours based clustering algorithm is introduced, in which a cluster is defined in terms of an equilibrium condition which balances its size and cohesiveness. The formulation of the equilibrium condition allows for a quantification of the strength of alignment of each point to a cluster, with these cluster alignment strengths leading naturally to a model selection criterion which renders the proposed approach fully automatable. The algorithm is simple to implement and computationally efficient, and produces clustering solutions of extremely high quality in comparison with relevant benchmarks from the literature. R code to implement the approach is available from https://github.com/DavidHofmeyr/NNEC.
The challenge of noisy multi-objective optimization lies in the constant trade-off between exploring new decision points and improving the precision of known points through resampling. This decision should take into account both the variability of the objective functions and the current estimate of a point in relation to the Pareto front. Since the amount and distribution of noise are generally unknown, it is desirable for a decision function to be highly adaptive to the properties of the optimization problem. This paper presents a resampling decision function that incorporates the stochastic nature of the optimization problem by using bootstrapping and the probability of dominance. The distribution-free estimation of the probability of dominance is achieved using bootstrap estimates of the means. To make the procedure applicable even with very few observations, we transfer the distribution observed at other decision points. The efficiency of this resampling approach is demonstrated by applying it in the NSGA-II algorithm with a sequential resampling procedure under multiple noise variations.
We consider the problem of consistent low-rank approximation for multigroup data: we ask for a sequence of $k$ basis vectors such that projecting the data onto their spanned subspace treats all groups as equally as possible, by minimizing the maximum error among the groups. Additionally, we require that the sequence of basis vectors satisfies the natural consistency property: when looking for the best $k$ vectors, the first $d<k$ vectors are the best possible solution to the problem of finding $d$ basis vectors. Thus, this multigroup low-rank approximation method naturally generalizes \svd and reduces to \svd for data with a single group. We give an iterative algorithm for this task that sequentially adds to the basis the vector that gives the best rank$-1$ projection according to the min-max criterion, and then projects the data onto the orthogonal complement of that vector. For finding the best rank$-1$ projection, we use primal-dual approaches or semidefinite programming. We analyze the theoretical properties of the algorithms and demonstrate empirically that the proposed methods compare favorably to existing methods for multigroup (or fair) PCA.
The Glivenko-Cantelli theorem is a uniform version of the strong law of large numbers. It states that for every IID sequence of random variables, the empirical measure converges to the underlying distribution (in the sense of uniform convergence of the CDF). In this work, we provide tools to study such limits of empirical measures in categorical probability. We propose two axioms, permutation invariance and empirical adequacy, that a morphism of type $X^\mathbb{N} \to X$ should satisfy to be interpretable as taking an infinite sequence as input and producing a sample from its empirical measure as output. Since not all sequences have a well-defined empirical measure, ``such empirical sampling morphisms'' live in quasi-Markov categories, which, unlike Markov categories, allow partial morphisms. Given an empirical sampling morphism and a few other properties, we prove representability as well as abstract versions of the de Finetti theorem, the Glivenko-Cantelli theorem and the strong law of large numbers. We provide several concrete constructions of empirical sampling morphisms as partially defined Markov kernels on standard Borel spaces. Instantiating our abstract results then recovers the standard Glivenko-Cantelli theorem and the strong law of large numbers for random variables with finite first moment. Our work thus provides a joint proof of these two theorems in conjunction with the de Finetti theorem from first principles.
In causal inference with observational studies, synthetic control (SC) has emerged as a prominent tool. SC has traditionally been applied to aggregate-level datasets, but more recent work has extended its use to individual-level data. As they contain a greater number of observed units, this shift introduces the curse of dimensionality to SC. To address this, we propose Cluster Synthetic Control (ClusterSC), based on the idea that groups of individuals may exist where behavior aligns internally but diverges between groups. ClusterSC incorporates a clustering step to select only the relevant donors for the target. We provide theoretical guarantees on the improvements induced by ClusterSC, supported by empirical demonstrations on synthetic and real-world datasets. The results indicate that ClusterSC consistently outperforms classical SC approaches.
Discrete-event simulation (DES) is widely used in healthcare Operations Research, but the models themselves are rarely shared. This limits their potential for reuse and long-term impact in the modelling and healthcare communities. This study explores the feasibility of using generative artificial intelligence (AI) to recreate published models using Free and Open Source Software (FOSS), based on the descriptions provided in an academic journal. Using a structured methodology, we successfully generated, tested and internally reproduced two DES models, including user interfaces. The reported results were replicated for one model, but not the other, likely due to missing information on distributions. These models are substantially more complex than AI-generated DES models published to date. Given the challenges we faced in prompt engineering, code generation, and model testing, we conclude that our iterative approach to model development, systematic comparison and testing, and the expertise of our team were necessary to the success of our recreated simulation models.
The RNA structure-function relationship has recently garnered significant attention within the deep learning community, promising to grow in importance as nucleic acid structure models advance. However, the absence of standardized and accessible benchmarks for deep learning on RNA 3D structures has impeded the development of models for RNA functional characteristics. In this work, we introduce a set of seven benchmarking datasets for RNA structure-function prediction, designed to address this gap. Our library builds on the established Python library rnaglib, and offers easy data distribution and encoding, splitters and evaluation methods, providing a convenient all-in-one framework for comparing models. Datasets are implemented in a fully modular and reproducible manner, facilitating for community contributions and customization. Finally, we provide initial baseline results for all tasks using a graph neural network. Source code: https://github.com/cgoliver/rnaglib Documentation: https://rnaglib.org
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.
We propose a novel two-stage framework of generative models named Debiasing Kernel-Based Generative Models (DKGM) with the insights from kernel density estimation (KDE) and stochastic approximation. In the first stage of DKGM, we employ KDE to bypass the obstacles in estimating the density of data without losing too much image quality. One characteristic of KDE is oversmoothing, which makes the generated image blurry. Therefore, in the second stage, we formulate the process of reducing the blurriness of images as a statistical debiasing problem and develop a novel iterative algorithm to improve image quality, which is inspired by the stochastic approximation. Extensive experiments illustrate that the image quality of DKGM on CIFAR10 is comparable to state-of-the-art models such as diffusion models and GAN models. The performance of DKGM on CelebA 128x128 and LSUN (Church) 128x128 is also competitive. We conduct extra experiments to exploit how the bandwidth in KDE affects the sample diversity and debiasing effect of DKGM. The connections between DKGM and score-based models are also discussed.
This document is an extended version of an abstract for a talk, with approximately the same title, to be held at the 7th Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik, from 24 to 28 March 2025 in Berlin. Here ``teachable'' is meant to apply to people ranging from sufficiently advanced high school pupils to university students in mathematics or statistics: For understanding most of the proposed approximation results, it should suffice to know binomial laws, their means and variances, and the standard normal distribution function (but not necessarily the concept of a corresponding normal random variable). Of the proposed approximations, some are well-known (at least to experts), and some are based on teaching experience and research at Trier University.