Applications

2025-03-28 | | Total: 13

#1 Confidence Adjusted Surprise Measure for Active Resourceful Trials (CA-SMART): A Data-driven Active Learning Framework for Accelerating Material Discovery under Resource Constraints [PDF] [Copy] [Kimi1] [REL]

Authors: Ahmed Shoyeb Raihan, Zhichao Liu, Tanveer Hossain Bhuiyan, Imtiaz Ahmed

Accelerating the discovery and manufacturing of advanced materials with specific properties is a critical yet formidable challenge due to vast search space, high costs of experiments, and time-intensive nature of material characterization. In recent years, active learning, where a surrogate machine learning (ML) model mimics the scientific discovery process of a human scientist, has emerged as a promising approach to address these challenges by guiding experimentation toward high-value outcomes with a limited budget. Among the diverse active learning philosophies, the concept of surprise (capturing the divergence between expected and observed outcomes) has demonstrated significant potential to drive experimental trials and refine predictive models. Scientific discovery often stems from surprise thereby making it a natural driver to guide the search process. Despite its promise, prior studies leveraging surprise metrics such as Shannon and Bayesian surprise lack mechanisms to account for prior confidence, leading to excessive exploration of uncertain regions that may not yield useful information. To address this, we propose the Confidence-Adjusted Surprise Measure for Active Resourceful Trials (CA-SMART), a novel Bayesian active learning framework tailored for optimizing data-driven experimentation. On a high level, CA-SMART incorporates Confidence-Adjusted Surprise (CAS) to dynamically balance exploration and exploitation by amplifying surprises in regions where the model is more certain while discounting them in highly uncertain areas. We evaluated CA-SMART on two benchmark functions (Six-Hump Camelback and Griewank) and in predicting the fatigue strength of steel. The results demonstrate superior accuracy and efficiency compared to traditional surprise metrics, standard Bayesian Optimization (BO) acquisition functions and conventional ML methods.

Subjects: Machine Learning , Artificial Intelligence , Applications

Publish: 2025-03-27 02:21:42 UTC


#2 Using large language models to produce literature reviews: Usages and systematic biases of microphysics parametrizations in 2699 publications [PDF] [Copy] [Kimi1] [REL]

Authors: Tianhang Zhang, Shengnan Fu, David M. Schultz, Zhonghua Zheng

Large language models afford opportunities for using computers for intensive tasks, realizing research opportunities that have not been considered before. One such opportunity could be a systematic interrogation of the scientific literature. Here, we show how a large language model can be used to construct a literature review of 2699 publications associated with microphysics parametrizations in the Weather and Research Forecasting (WRF) model, with the goal of learning how they were used and their systematic biases, when simulating precipitation. The database was constructed of publications identified from Web of Science and Scopus searches. The large language model GPT-4 Turbo was used to extract information about model configurations and performance from the text of 2699 publications. Our results reveal the landscape of how nine of the most popular microphysics parameterizations have been used around the world: Lin, Ferrier, WRF Single-Moment, Goddard Cumulus Ensemble, Morrison, Thompson, and WRF Double-Moment. More studies used one-moment parameterizations before 2020 and two-moment parameterizations after 2020. Seven out of nine parameterizations tended to overestimate precipitation. However, systematic biases of parameterizations differed in various regions. Except simulations using the Lin, Ferrier, and Goddard parameterizations that tended to underestimate precipitation over almost all locations, the remaining six parameterizations tended to overestimate, particularly over China, southeast Asia, western United States, and central Africa. This method could be used by other researchers to help understand how the increasingly massive body of scientific literature can be harnessed through the power of artificial intelligence to solve their research problems.

Subjects: Artificial Intelligence , Applications

Publish: 2025-03-27 10:42:19 UTC


#3 Unlocking the Potential of Past Research: Using Generative AI to Reconstruct Healthcare Simulation Models [PDF] [Copy] [Kimi1] [REL]

Authors: Thomas Monks, Alison Harper, Amy Heather

Discrete-event simulation (DES) is widely used in healthcare Operations Research, but the models themselves are rarely shared. This limits their potential for reuse and long-term impact in the modelling and healthcare communities. This study explores the feasibility of using generative artificial intelligence (AI) to recreate published models using Free and Open Source Software (FOSS), based on the descriptions provided in an academic journal. Using a structured methodology, we successfully generated, tested and internally reproduced two DES models, including user interfaces. The reported results were replicated for one model, but not the other, likely due to missing information on distributions. These models are substantially more complex than AI-generated DES models published to date. Given the challenges we faced in prompt engineering, code generation, and model testing, we conclude that our iterative approach to model development, systematic comparison and testing, and the expertise of our team were necessary to the success of our recreated simulation models.

Subjects: Artificial Intelligence , Applications

Publish: 2025-03-27 16:10:02 UTC


#4 Probabilistic Downscaling for Flood Hazard Models [PDF] [Copy] [Kimi] [REL]

Authors: Samantha Roth, Sanjib Sharma, Atieh Alipour, Klaus Keller, Murali Haran

Riverine flooding poses significant risks. Developing strategies to manage flood risks requires flood projections with decision-relevant scales and well-characterized uncertainties, often at high spatial resolutions. However, calibrating high-resolution flood models can be computationally prohibitive. To address this challenge, we propose a probabilistic downscaling approach that maps low-resolution model projections onto higher-resolution grids. The existing literature presents two distinct types of downscaling approaches: (1) probabilistic methods, which are versatile and applicable across various physics-based models, and (2) deterministic downscaling methods, specifically tailored for flood hazard models. Both types of downscaling approaches come with their own set of mutually exclusive advantages. Here we introduce a new approach, PDFlood, that combines the advantages of existing probabilistic and flood model-specific downscaling approaches, mainly (1) spatial flooding probabilities and (2) improved accuracy from approximating physical processes. Compared to the state of the art deterministic downscaling approach for flood hazard models, PDFlood allows users to consider previously neglected uncertainties while providing comparable accuracy, thereby better informing the design of risk management strategies. While we develop PDFlood for flood models, the general concepts translate to other applications such as wildfire models.

Subjects: Methodology , Applications

Publish: 2025-03-26 19:55:58 UTC


#5 Least Squares as Random Walks [PDF] [Copy] [Kimi] [REL]

Authors: Alexander Kostinski, Glenn Ierley, Sarah Kostinski

Linear least squares (LLS) is perhaps the most common method of data analysis, dating back to Legendre, Gauss and Laplace. Framed as linear regression, LLS is also a backbone of mathematical statistics. Here we report on an unexpected new connection between LLS and random walks. To that end, we introduce the notion of a random walk based on a discrete sequence of data samples (data walk). We show that the slope of a straight line which annuls the net area under a residual data walk equals the one found by LLS. For equidistant data samples this result is exact and holds for an arbitrary distribution of steps.

Subjects: Methodology , Statistical Mechanics , Probability , Data Analysis, Statistics and Probability , Applications

Publish: 2025-03-26 20:08:06 UTC


#6 Deep Learning for Forensic Identification of Source [PDF] [Copy] [Kimi] [REL]

Authors: Cole Patten, Christopher Saunders, Michael Puthawala

We used contrastive neural networks to learn useful similarity scores between the 144 cartridge casings in the NBIDE dataset, under the common-but-unknown source paradigm. The common-but-unknown source problem is a problem archetype in forensics where the question is whether two objects share a common source (e.g. were two cartridge casings fired from the same firearm). Similarity scores are often used to interpret evidence under this paradigm. We directly compared our results to a state-of-the-art algorithm, Congruent Matching Cells (CMC). When trained on the E3 dataset of 2967 cartridge casings, contrastive learning achieved an ROC AUC of 0.892. The CMC algorithm achieved 0.867. We also conducted an ablation study where we varied the neural network architecture; specifically, the network's width or depth. The ablation study showed that contrastive network performance results are somewhat robust to the network architecture. This work was in part motivated by the use of similarity scores attained via contrastive learning for standard evidence interpretation methods such as score-based likelihood ratios.

Subjects: Machine Learning , Applications , Machine Learning

Publish: 2025-03-26 21:13:08 UTC


#7 Compositional Outcomes and Environmental Mixtures: the Dirichlet Bayesian Weighted Quantile Sum Regression [PDF] [Copy] [Kimi] [REL]

Authors: Hachem Saddiki, Joshua L. Warren, Corina Lesseur, Elena Colicino

Environmental mixture approaches do not accommodate compositional outcomes, consisting of vectors constrained onto the unit simplex. This limitation poses challenges in effectively evaluating the associations between multiple concurrent environmental exposures and their respective impacts on this type of outcomes. As a result, there is a pressing need for the development of analytical methods that can more accurately assess the complexity of these relationships. Here, we extend the Bayesian weighted quantile sum regression (BWQS) framework for jointly modeling compositional outcomes and environmental mixtures using a Dirichlet distribution with a multinomial logit link function. The proposed approach, named Dirichlet-BWQS (DBWQS), allows for the simultaneous estimation of mixture weights associated with each exposure mixture component as well as the association between the overall exposure mixture index and each outcome proportion. We assess the performance of DBWQS regression on extensive simulated data and a real scenario where we investigate the associations between environmental chemical mixtures and DNA methylation-derived placental cell composition, using publicly available data (GSE75248). We also compare our findings with results considering environmental mixtures and each outcome component. Finally, we developed an R package xbwqs where we made our proposed method publicly available (https://github.com/hasdk/xbwqs).

Subjects: Methodology , Applications

Publish: 2025-03-27 12:13:41 UTC


#8 Sparse Bayesian Learning for Label Efficiency in Cardiac Real-Time MRI [PDF] [Copy] [Kimi] [REL]

Authors: Felix Terhag, Philipp Knechtges, Achim Basermann, Anja Bach, Darius Gerlach, Jens Tank, Raúl Tempone

Cardiac real-time magnetic resonance imaging (MRI) is an emerging technology that images the heart at up to 50 frames per second, offering insight into the respiratory effects on the heartbeat. However, this method significantly increases the number of images that must be segmented to derive critical health indicators. Although neural networks perform well on inner slices, predictions on outer slices are often unreliable. This work proposes sparse Bayesian learning (SBL) to predict the ventricular volume on outer slices with minimal manual labeling to address this challenge. The ventricular volume over time is assumed to be dominated by sparse frequencies corresponding to the heart and respiratory rates. Moreover, SBL identifies these sparse frequencies on well-segmented inner slices by optimizing hyperparameters via type -II likelihood, automatically pruning irrelevant components. The identified sparse frequencies guide the selection of outer slice images for labeling, minimizing posterior variance. This work provides performance guarantees for the greedy algorithm. Testing on patient data demonstrates that only a few labeled images are necessary for accurate volume prediction. The labeling procedure effectively avoids selecting inefficient images. Furthermore, the Bayesian approach provides uncertainty estimates, highlighting unreliable predictions (e.g., when choosing suboptimal labels).

Subjects: Methodology , Computer Vision and Pattern Recognition , Probability , Statistics Theory , Applications

Publish: 2025-03-27 12:36:20 UTC


#9 Inequality Restricted Minimum Density Power Divergence Estimation in Panel Count Data [PDF] [Copy] [Kimi] [REL]

Authors: Udita Goswami, Shuvashree Mondal

Analysis of panel count data has garnered a considerable amount of attention in the literature, leading to the development of multiple statistical techniques. In inferential analysis, most of the works focus on leveraging estimating equations-based techniques or conventional maximum likelihood estimation. However, the robustness of these methods is largely questionable. In this paper, we present the robust density power divergence estimation for panel count data arising from nonhomogeneous Poisson processes, correlated through a latent frailty variable. In order to cope with real-world incidents, it is often desired to impose certain inequality constraints on the parameter space, giving rise to the restricted minimum density power divergence estimator. The significant contribution of this study lies in deriving its asymptotic properties. The proposed method ensures high efficiency in the model estimation while providing reliable inference despite data contamination. Moreover, the density power divergence measure is governed by a tuning parameter \(\gamma\), which controls the trade-off between robustness and efficiency. To effectively determine the optimal value of \(\gamma\), this study employs a generalized score-matching technique, marking considerable progress in the data analysis. Simulation studies and real data examples are provided to illustrate the performance of the estimator and to substantiate the theory developed.

Subjects: Methodology , Applications

Publish: 2025-03-27 14:26:20 UTC


#10 De Novo Functional Protein Sequence Generation: Overcoming Data Scarcity through Regeneration and Large Models [PDF] [Copy] [Kimi] [REL]

Authors: Chenyu Ren, Daihai He, Jian Huang

Proteins are essential components of all living organisms and play a critical role in cellular survival. They have a broad range of applications, from clinical treatments to material engineering. This versatility has spurred the development of protein design, with amino acid sequence design being a crucial step in the process. Recent advancements in deep generative models have shown promise for protein sequence design. However, the scarcity of functional protein sequence data for certain types can hinder the training of these models, which often require large datasets. To address this challenge, we propose a hierarchical model named ProteinRG that can generate functional protein sequences using relatively small datasets. ProteinRG begins by generating a representation of a protein sequence, leveraging existing large protein sequence models, before producing a functional protein sequence. We have tested our model on various functional protein sequences and evaluated the results from three perspectives: multiple sequence alignment, t-SNE distribution analysis, and 3D structure prediction. The findings indicate that our generated protein sequences maintain both similarity to the original sequences and consistency with the desired functions. Moreover, our model demonstrates superior performance compared to other generative models for protein sequence generation.

Subject: Applications

Publish: 2025-03-27 03:25:21 UTC


#11 Explainable Boosting Machine for Predicting Claim Severity and Frequency in Car Insurance [PDF] [Copy] [Kimi] [REL]

Authors: Markéta Krùpovà, Nabil Rachdi, Quentin Guibert

In a context of constant increase in competition and heightened regulatory pressure, accuracy, actuarial precision, as well as transparency and understanding of the tariff, are key issues in non-life insurance. Traditionally used generalized linear models (GLM) result in a multiplicative tariff that favors interpretability. With the rapid development of machine learning and deep learning techniques, actuaries and the rest of the insurance industry have adopted these techniques widely. However, there is a need to associate them with interpretability techniques. In this paper, our study focuses on introducing an Explainable Boosting Machine (EBM) model that combines intrinsically interpretable characteristics and high prediction performance. This approach is described as a glass-box model and relies on the use of a Generalized Additive Model (GAM) and a cyclic gradient boosting algorithm. It accounts for univariate and pairwise interaction effects between features and provides naturally explanations on them. We implement this approach on car insurance frequency and severity data and extensively compare the performance of this approach with classical competitors: a GLM, a GAM, a CART model and an Extreme Gradient Boosting (XGB) algorithm. Finally, we examine the interpretability of these models to capture the main determinants of claim costs.

Subjects: Applications , Machine Learning , Machine Learning

Publish: 2025-03-27 09:59:45 UTC


#12 Calibration of medium-range metocean forecasts for the North Sea [PDF] [Copy] [Kimi] [REL]

Authors: Conor Murphy, Ross Towe, Philip Jonathan

We assess the value of calibrating forecast models for significant wave height Hs, wind speed W and mean spectral wave period Tm for forecast horizons between zero and 168 hours from a commercial forecast provider, to improve forecast performance for a location in the central North Sea. We consider two straightforward calibration models, linear regression (LR) and non-homogeneous Gaussian regression (NHGR), incorporating deterministic, control and ensemble mean forecast covariates. We show that relatively simple calibration models (with at most three covariates) provide good calibration and that addition of further covariates cannot be justified. Optimal calibration models (for the forecast mean of a physical quantity) always make use of the deterministic forecast and ensemble mean forecast for the same quantity, together with a covariate associated with a different physical quantity. The selection of optimal covariates is performed independently per forecast horizon, and the set of optimal covariates shows a large degree of consistency across forecast horizons. As a result, it is possible to specify a consistent model to calibrate a given physical quantity, incorporating a common set of three covariates for all horizons. For NHGR models of a given physical quantity, the ensemble forecast standard deviation for that quantity is skilful in predicting forecast error standard deviation, strikingly so for Hs. We show that the consistent LR and NHGR calibration models facilitate reduction in forecast bias to near zero for all of Hs, W and Tm, and that there is little difference between LR and NHGR calibration for the mean. Both LR and NHGR models facilitate reduction in forecast error standard deviation relative to naive adoption of the (uncalibrated) deterministic forecast, with NHGR providing somewhat better performance.

Subject: Applications

Publish: 2025-03-27 13:24:16 UTC


#13 Investigating Experiential Effects in Online Chess using a Hierarchical Bayesian Analysis [PDF] [Copy] [Kimi] [REL]

Authors: Adam Gee, Sydney O. Seese, James P. Curley, Owen G. Ward

The presence or absence of winner-loser effects is a widely discussed phenomenon across both sports and psychology research. Investigation of such effects is often hampered by the limited availability of data. Online chess has exploded in popularity in recent years and provides vast amounts of data which can be used to explore this question. With a hierarchical Bayesian regression model, we carefully investigate the presence of such experiential effects in online chess. Using a large quantity of online chess data, we see little evidence for experiential effects that are consistent across all players, with some individual players showing some evidence for such effects. Given the challenging temporal nature of this data, we discuss several methods for assessing the suitability of our model and carefully check its validity.

Subject: Applications

Publish: 2025-03-27 17:24:48 UTC