Data Analysis, Statistics and Probability

2026-04-17 | | Total: 4

#1 Development of an LLM-Based System for Automatic Code Generation from HEP Publications [PDF] [Copy] [Kimi] [REL]

Authors: Masahiko Saito, Tomoe Kishimoto, Junichi Tanaka

Ensuring the reproducibility of physics results is one of the crucial challenges in high-energy physics (HEP). In this study, we develop a proof-of-concept system that uses large language models (LLMs) to extract analysis procedures from HEP publications and generate executable analysis code for reproducing published results. Our method consists of two stages. In the first stage, open-weight LLMs extract event selection criteria, object definitions, and other relevant analysis information from a target paper and, when necessary, from its referenced publications, and then produce a structured selection list. In the second stage, the structured selection list is used to generate analysis code, which is then executed and validated iteratively. As a benchmark, we use the ATLAS $H \to ZZ^{*} \to 4\ell$ analysis based on proton-proton collision data recorded in 2015 and 2016 and released as ATLAS Open Data. This benchmark allows direct comparison between the generated results and the published analysis, as well as comparison with a manually developed baseline implementation. We separately evaluate selection extraction and code generation in order to clarify the current capabilities and limitations of open-weight LLMs for HEP analysis reproduction. Our initial results show that recent open-weight models can recover many documented selection criteria from papers and references, and that in some runs they can generate event selections fully matching a baseline implementation at the event level. At the same time, stochasticity, hallucination, and execution failure remain significant challenges. These results suggest that LLMs are already promising as human-in-the-loop tools for reproducibility support, although they are not yet reliable as fully autonomous HEP analysis agents. In this paper, we report the design of the prototype system and its initial performance evaluation.

Subject: Data Analysis, Statistics and Probability

Publish: 2026-04-16 06:59:16 UTC


#2 NOMAI : A real-time photometric classifier for superluminous supernovae identification. A science module for the Fink broker [PDF] [Copy] [Kimi] [REL]

Authors: E. Russeil, R. Lunnan, J. Peloton, S. Schulze, P. J. Pessi, D. Perley, J. Sollerman, A. Gkini, Y. Hu, T. -W. Chen, E. C. Bellm, T. X. Chen, B. Rusholme

Superluminous supernovae (SLSNe) are one of the most luminous stellar explosions known, yet they remain poorly understood. Because they are intrinsically rare, efficiently identifying them in the large alert streams produced by modern time-domain surveys is essential for enabling spectroscopic follow-up. We present NOMAI, a machine learning classifier designed to identify SLSN candidates directly from photometric alerts in the ZTF stream, using light curves accumulated over at least 30 days. It does not require any spectroscopic redshift and is running in real time within the Fink broker. ZTF light curves are transformed into a set of physically motivated features derived primarily from model-fitting procedures using SALT2 and Rainbow, a blackbody-based multi-band fitting framework. These features are used to train an XGBoost classifier on a curated dataset of labeled ZTF sources constructed using literature samples of SLSNe, along with TNS and internal ZTF labeled sources. The final training dataset contains 5280 unique sources, including 225 spectroscopically classified SLSNe. On the training sample, the classifier reaches 66% completeness and 58% purity. Deployed within the Fink broker, NOMAI has been running continuously since 18/12/2025 on the ZTF alert stream and publicly reports SLSN candidates every night by automatically posting them to dedicated communication channels. Based on this, we also report the first two-month as an evaluation period, where the classifier successfully recovered 22 of the 24 active SLSNe reported on the Transient Name Server. The achieved performances demonstrate that the classifier provides a valuable tool for experts to efficiently scan the alert stream and identify promising candidates. In the near future, NOMAI is intended to be adapted to operate on the Legacy Survey of Space and Time conducted by the Vera C. Rubin Observatory.

Subjects: Instrumentation and Methods for Astrophysics , High Energy Astrophysical Phenomena , Data Analysis, Statistics and Probability

Publish: 2026-04-16 08:22:04 UTC


#3 FAIR Universe Weak Lensing ML Uncertainty Challenge: Handling Uncertainties and Distribution Shifts for Precision Cosmology [PDF] [Copy] [Kimi] [REL]

Authors: Biwei Dai, Po-Wen Chang, Wahid Bhimji, Paolo Calafiura, Ragansu Chakkappai, Yuan-Tang Chou, Sascha Diefenbacher, Jordan Dudley, Ibrahim Elsharkawy, Steven Farrell, Isabelle Guyon, Chris Harris, Elham E Khoda, Benjamin Nachman, David Rousseau, Uroš Seljak, Ihsan Ullah, Yulei Zhang

Weak gravitational lensing, the correlated distortion of background galaxy shapes by foreground structures, is a powerful probe of the matter distribution in our universe and allows accurate constraints on the cosmological model. In recent years, high-order statistics and machine learning (ML) techniques have been applied to weak lensing data to extract the nonlinear information beyond traditional two-point analysis. However, these methods typically rely on cosmological simulations, which poses several challenges: simulations are computationally expensive, limiting most realistic setups to a low training data regime; inaccurate modeling of systematics in the simulations create distribution shifts that can bias cosmological parameter constraints; and varying simulation setups across studies make method comparison difficult. To address these difficulties, we present the first weak lensing benchmark dataset with several realistic systematics and launch the FAIR Universe Weak Lensing Machine Learning Uncertainty Challenge. The challenge focuses on measuring the fundamental properties of the universe from weak lensing data with limited training set and potential distribution shifts, while providing a standardized benchmark for rigorous comparison across methods. Organized in two phases, the challenge will bring together the physics and ML communities to advance the methodologies for handling systematic uncertainties, data efficiency, and distribution shifts in weak lensing analysis with ML, ultimately facilitating the deployment of ML approaches into upcoming weak lensing survey analysis.

Subjects: Cosmology and Nongalactic Astrophysics , Artificial Intelligence , Computer Vision and Pattern Recognition , Data Analysis, Statistics and Probability

Publish: 2026-04-15 22:09:23 UTC


#4 Seabird trajectories map onto a reduced optimal-control bound for dynamic soaring [PDF] [Copy] [Kimi] [REL]

Authors: Louis González, Saad Bhamla

Dynamic soaring allows seabirds to harvest mechanical energy from vertical wind shear, but field trajectories lack a benchmark for comparing flight performances across species. We derive a reduced lower bound on transport effort from a simplified Hamilton-Jacobi-Bellman optimal-control model in which slow flight incurs an induced-drag penalty, fast flight incurs a dissipative penalty, and wind shear supplies an effective energetic subsidy. After species-specific normalization of transport speed and an accelerometer-based effort proxy, we map wandering albatrosses, Cory's shearwaters, and Eurasian oystercatchers into a common reduced speed-effort plane and estimate their empirical lower frontiers. The albatross frontier lies closest to the reduced bound, consistent with near-optimal wind-energy harvesting. The shearwater frontier is systematically displaced above it, and oystercatchers occupy a distinct non-soaring regime. The resulting framework places specialist dynamic soaring, mixed flap-gliding, and non-soaring flight in a common mechanical representation and provides a reduced benchmark for comparing wind-assisted flight across species using field trajectories.

Subjects: Biological Physics , Data Analysis, Statistics and Probability

Publish: 2026-04-15 18:06:55 UTC