Quantitative Methods

2025-06-19 | | Total: 6

#1 DisProtEdit: Exploring Disentangled Representations for Multi-Attribute Protein Editing [PDF] [Copy] [Kimi1] [REL]

Authors: Max Ku, Sun Sun, Hongyu Guo, Wenhu Chen

We introduce DisProtEdit, a controllable protein editing framework that leverages dual-channel natural language supervision to learn disentangled representations of structural and functional properties. Unlike prior approaches that rely on joint holistic embeddings, DisProtEdit explicitly separates semantic factors, enabling modular and interpretable control. To support this, we construct SwissProtDis, a large-scale multimodal dataset where each protein sequence is paired with two textual descriptions, one for structure and one for function, automatically decomposed using a large language model. DisProtEdit aligns protein and text embeddings using alignment and uniformity objectives, while a disentanglement loss promotes independence between structural and functional semantics. At inference time, protein editing is performed by modifying one or both text inputs and decoding from the updated latent representation. Experiments on protein editing and representation learning benchmarks demonstrate that DisProtEdit performs competitively with existing methods while providing improved interpretability and controllability. On a newly constructed multi-attribute editing benchmark, the model achieves a both-hit success rate of up to 61.7%, highlighting its effectiveness in coordinating simultaneous structural and functional edits.

Subjects: Quantitative Methods , Machine Learning

Publish: 2025-06-17 06:12:18 UTC


#2 Automatic computation of the glycemic index: data driven analysis of the glucose standard [PDF] [Copy] [Kimi] [REL]

Authors: Fabio Credali, Maria Teresa Venuti, Daniele Boffi, Paola Rossi

The Glycemic Index (GI) is a tool for classifying carbohydrates based on their impact on postprandial glycemia, useful for diabetes prevention and management. This study applies a mathematical model for a data driven simulation of the glycemic response following glucose ingestion. The analysis reveals a direct correlation between glucose response profiles and parameters describing glucose absorption, enabling the classification of subjects into three groups based on the timing of their glycemic peak. Our results offer potential applications for both glycemic index simulation and advancing biological studies on diabetes.

Subjects: Dynamical Systems , Quantitative Methods

Publish: 2025-06-18 14:03:54 UTC


#3 Universal Laboratory Model: prognosis of abnormal clinical outcomes based on routine tests [PDF] [Copy] [Kimi] [REL]

Authors: Pavel Karpov, Ilya Petrenkov, Ruslan Raiman

Clinical laboratory results are ubiquitous in any diagnosis making. Predicting abnormal values of not prescribed tests based on the results of performed tests looks intriguing, as it would be possible to make early diagnosis available to everyone. The special place is taken by the Common Blood Count (CBC) test, as it is the most widely used clinical procedure. Combining routine biochemical panels with CBC presents a set of test-value pairs that varies from patient to patient, or, in common settings, a table with missing values. Here we formulate a tabular modeling problem as a set translation problem where the source set comprises pairs of GPT-like label column embedding and its corresponding value while the target set consists of the same type embeddings only. The proposed approach can effectively deal with missing values without implicitly estimating them and bridges the world of LLM with the tabular domain. Applying this method to clinical laboratory data, we achieve an improvement up to 8% AUC for joint predictions of high uric acid, glucose, cholesterol, and low ferritin levels.

Subjects: Machine Learning , Quantitative Methods

Publish: 2025-06-18 10:10:02 UTC


#4 BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models [PDF] [Copy] [Kimi] [REL]

Authors: Bharath Dandala, Michael M. Danziger, Ella Barkan, Tanwi Biswas, Viatcheslav Gurev, Jianying Hu, Matthew Madgwick, Akira Koseki, Tal Kozlovski, Michal Rosen-Zvi, Yishai Shimoni, Ching-Huei Tsou

Transcriptomic foundation models (TFMs) have recently emerged as powerful tools for analyzing gene expression in cells and tissues, supporting key tasks such as cell-type annotation, batch correction, and perturbation prediction. However, the diversity of model implementations and training strategies across recent TFMs, though promising, makes it challenging to isolate the contribution of individual design choices or evaluate their potential synergies. This hinders the field's ability to converge on best practices and limits the reproducibility of insights across studies. We present BMFM-RNA, an open-source, modular software package that unifies diverse TFM pretraining and fine-tuning objectives within a single framework. Leveraging this capability, we introduce a novel training objective, whole cell expression decoder (WCED), which captures global expression patterns using an autoencoder-like CLS bottleneck representation. In this paper, we describe the framework, supported input representations, and training objectives. We evaluated four model checkpoints pretrained on CELLxGENE using combinations of masked language modeling (MLM), WCED and multitask learning. Using the benchmarking capabilities of BMFM-RNA, we show that WCED-based models achieve performance that matches or exceeds state-of-the-art approaches like scGPT across more than a dozen datasets in both zero-shot and fine-tuning tasks. BMFM-RNA, available as part of the biomed-multi-omics project ( https://github.com/BiomedSciAI/biomed-multi-omic ), offers a reproducible foundation for systematic benchmarking and community-driven exploration of optimal TFM training strategies, enabling the development of more effective tools to leverage the latest advances in AI for understanding cell biology.

Subjects: Genomics , Artificial Intelligence , Quantitative Methods

Publish: 2025-06-17 15:40:08 UTC


#5 AZT1D: A Real-World Dataset for Type 1 Diabetes [PDF] [Copy] [Kimi] [REL]

Authors: Saman Khamesian, Asiful Arefeen, Bithika M. Thompson, Maria Adela Grando, Hassan Ghasemzadeh

High quality real world datasets are essential for advancing data driven approaches in type 1 diabetes (T1D) management, including personalized therapy design, digital twin systems, and glucose prediction models. However, progress in this area has been limited by the scarcity of publicly available datasets that offer detailed and comprehensive patient data. To address this gap, we present AZT1D, a dataset containing data collected from 25 individuals with T1D on automated insulin delivery (AID) systems. AZT1D includes continuous glucose monitoring (CGM) data, insulin pump and insulin administration data, carbohydrate intake, and device mode (regular, sleep, and exercise) obtained over 6 to 8 weeks for each patient. Notably, the dataset provides granular details on bolus insulin delivery (i.e., total dose, bolus type, correction specific amounts) features that are rarely found in existing datasets. By offering rich, naturalistic data, AZT1D supports a wide range of artificial intelligence and machine learning applications aimed at improving clinical decision making and individualized care in T1D.

Subjects: Machine Learning , Quantitative Methods

Publish: 2025-05-27 21:54:53 UTC


#6 Integrating Dynamical Systems Learning with Foundational Models: A Meta-Evolutionary AI Framework for Clinical Trials [PDF] [Copy] [Kimi] [REL]

Authors: Joseph Geraci, Bessi Qorri, Christian Cumbaa, Mike Tsay, Paul Leonczyk, Luca Pani

Artificial intelligence (AI) has evolved into an ecosystem of specialized "species," each with unique strengths. We analyze two: DeepSeek-V3, a 671-billion-parameter Mixture of Experts large language model (LLM) exemplifying scale-driven generality, and NetraAI, a dynamical system-based framework engineered for stability and interpretability on small clinical trial datasets. We formalize NetraAI's foundations, combining contraction mappings, information geometry, and evolutionary algorithms to identify predictive patient cohorts. Features are embedded in a metric space and iteratively contracted toward stable attractors that define latent subgroups. A pseudo-temporal embedding and long-range memory enable exploration of higher-order feature interactions, while an internal evolutionary loop selects compact, explainable 2-4-variable bundles ("Personas"). To guide discovery, we introduce an LLM Strategist as a meta-evolutionary layer that observes Persona outputs, prioritizes promising variables, injects domain knowledge, and assesses robustness. This two-tier architecture mirrors the human scientific process: NetraAI as experimentalist, the LLM as theorist, forming a self-improving loop. In case studies (schizophrenia, depression, pancreatic cancer), NetraAI uncovered small, high-effect-size subpopulations that transformed weak baseline models (AUC ~0.50-0.68) into near-perfect classifiers using only a few features. We position NetraAI at the intersection of dynamical systems, information geometry, and evolutionary learning, aligned with emerging concept-level reasoning paradigms such as LeCun's Joint Embedding Predictive Architecture (JEPA). By prioritizing reliable, explainable knowledge, NetraAI offers a new generation of adaptive, self-reflective AI to accelerate clinical discovery.

Subjects: Machine Learning , Quantitative Methods

Publish: 2025-05-25 03:34:33 UTC