Quantitative Methods

2024-10-04 | | Total: 8

#1 G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models [PDF1] [Copy] [Kimi1] [REL]

Authors: Zhaoning Yu ; Xiangyang Xu ; Hongyang Gao

We introduce G2T-LLM, a novel approach for molecule generation that uses graph-to-tree text encoding to transform graph-based molecular structures into a hierarchical text format optimized for large language models (LLMs). This encoding converts complex molecular graphs into tree-structured formats, such as JSON and XML, which LLMs are particularly adept at processing due to their extensive pre-training on these types of data. By leveraging the flexibility of LLMs, our approach allows for intuitive interaction using natural language prompts, providing a more accessible interface for molecular design. Through supervised fine-tuning, G2T-LLM generates valid and coherent chemical structures, addressing common challenges like invalid outputs seen in traditional graph-based methods. While LLMs are computationally intensive, they offer superior generalization and adaptability, enabling the generation of diverse molecular structures with minimal task-specific customization. The proposed approach achieved comparable performances with state-of-the-art methods on various benchmark molecular generation datasets, demonstrating its potential as a flexible and innovative tool for AI-driven molecular design.

Subjects: Machine Learning ; Artificial Intelligence ; Quantitative Methods

Publish: 2024-10-03 04:25:21 UTC

#2 Deep Generative Modeling for Identification of Noisy, Non-Stationary Dynamical Systems [PDF1] [Copy] [Kimi] [REL]

Authors: Doris Voina ; Steven Brunton ; J. Nathan Kutz

A significant challenge in many fields of science and engineering is making sense of time-dependent measurement data by recovering governing equations in the form of differential equations. We focus on finding parsimonious ordinary differential equation (ODE) models for nonlinear, noisy, and non-autonomous dynamical systems and propose a machine learning method for data-driven system identification. While many methods tackle noisy and limited data, non-stationarity - where differential equation parameters change over time - has received less attention. Our method, dynamic SINDy, combines variational inference with SINDy (sparse identification of nonlinear dynamics) to model time-varying coefficients of sparse ODEs. This framework allows for uncertainty quantification of ODE coefficients, expanding on previous methods for autonomous systems. These coefficients are then interpreted as latent variables and added to the system to obtain an autonomous dynamical model. We validate our approach using synthetic data, including nonlinear oscillators and the Lorenz system, and apply it to neuronal activity data from C. elegans. Dynamic SINDy uncovers a global nonlinear model, showing it can handle real, noisy, and chaotic datasets. We aim to apply our method to a variety of problems, specifically dynamic systems with complex time-dependent parameters.

Subjects: Machine Learning ; Quantitative Methods

Publish: 2024-10-02 23:00:00 UTC

#3 Enhancing End Stage Renal Disease Outcome Prediction: A Multi-Sourced Data-Driven Approach [PDF1] [Copy] [Kimi] [REL]

Authors: Yubo Li ; Rema Padman

Objective: To improve prediction of Chronic Kidney Disease (CKD) progression to End Stage Renal Disease (ESRD) using machine learning (ML) and deep learning (DL) models applied to an integrated clinical and claims dataset of varying observation windows, supported by explainable AI (XAI) to enhance interpretability and reduce bias. Materials and Methods: We utilized data about 10,326 CKD patients, combining their clinical and claims information from 2009 to 2018. Following data preprocessing, cohort identification, and feature engineering, we evaluated multiple statistical, ML and DL models using data extracted from five distinct observation windows. Feature importance and Shapley value analysis were employed to understand key predictors. Models were tested for robustness, clinical relevance, misclassification errors and bias issues. Results: Integrated data models outperformed those using single data sources, with the Long Short-Term Memory (LSTM) model achieving the highest AUC (0.93) and F1 score (0.65). A 24-month observation window was identified as optimal for balancing early detection and prediction accuracy. The 2021 eGFR equation improved prediction accuracy and reduced racial bias, notably for African American patients. Discussion: Improved ESRD prediction accuracy, results interpretability and bias mitigation strategies presented in this study have the potential to significantly enhance CKD and ESRD management, support targeted early interventions and reduce healthcare disparities. Conclusion: This study presents a robust framework for predicting ESRD outcomes in CKD patients, improving clinical decision-making and patient care through multi-sourced, integrated data and AI/ML methods. Future research will expand data integration and explore the application of this framework to other chronic diseases.

Subjects: Quantitative Methods ; Machine Learning

Publish: 2024-10-02 03:21:01 UTC

#4 Global dynamical structures from infinitesimal data [PDF1] [Copy] [Kimi] [REL]

Authors: Benjamin McInroe ; Robert J. Full ; Daniel E. Koditschek ; Yuliy Baryshnikov

Discovering mechanisms underlying the behaviors of complex, high dimensional, and nonlinear dynamical systems is a central goal of the natural and synthetic sciences. Breakthroughs in machine learning in concert with increasing capacities for computation and data collection have enabled the use of trajectory measurements for learning predictive models. However, rigorous approaches for interpreting mechanisms from such models remain elusive, and asymptotic prediction accuracy suffers if the model does not capture important state space structures (e.g., attracting invariant sets). These limitations are especially pressing for system-level behaviors such as whole-body locomotion, where discontinuous, transient, and multiscale phenomena are common and prior models are rare. To take the next step towards a theory and practice for dynamical inference of complex multiscale systems in biology and beyond, we introduce VERT, a framework for learning the attracting sets that characterize global system behavior without recourse to learning a global model. Our approach is based on an infinitesimal-local-global (ILG) framework for estimating the proximity of any sampled state to the attracting set, if one exists, with formal accuracy guarantees. We demonstrate our approach on phenomenological and physical oscillators with hierarchical and impulsive dynamics, finding sensitivity to both global and intermediate attractors composed in sequence and parallel. Application of VERT to human running kinematics data reveals insight into control modules that stabilize task-level dynamics, supporting a longstanding neuromechanical control hypothesis. The VERT framework thus enables rigorous inference of underlying dynamical structure even for systems where learning a global dynamics model is impractical or impossible.

Subjects: Quantitative Methods ; Dynamical Systems ; Chaotic Dynamics

Publish: 2024-10-03 00:30:05 UTC

#5 DeepProtein: Deep Learning Library and Benchmark for Protein Sequence Learning [PDF] [Copy] [Kimi] [REL]

Authors: Jiaqing Xie ; Yue Zhao ; Tianfan Fu

In recent years, deep learning has revolutionized the field of protein science, enabling advancements in predicting protein properties, structural folding and interactions. This paper presents DeepProtein, a comprehensive and user-friendly deep learning library specifically designed for protein-related tasks. DeepProtein integrates a couple of state-of-the-art neural network architectures, which include convolutional neural network (CNN), recurrent neural network (RNN), transformer, graph neural network (GNN), and graph transformer (GT). It provides user-friendly interfaces, facilitating domain researchers in applying deep learning techniques to protein data. Also, we curate a benchmark that evaluates these neural architectures on a variety of protein tasks, including protein function prediction, protein localization prediction, and protein-protein interaction prediction, showcasing its superior performance and scalability. Additionally, we provide detailed documentation and tutorials to promote accessibility and encourage reproducible research. This library is extended from a well-known drug discovery library, DeepPurpose and publicly available at https://github.com/jiaqingxie/DeepProtein/tree/main.

Subjects: Machine Learning ; Artificial Intelligence ; Quantitative Methods

Publish: 2024-10-02 20:42:32 UTC

#6 FARM: Functional Group-Aware Representations for Small Molecules [PDF] [Copy] [Kimi] [REL]

Authors: Thao Nguyen ; Kuan-Hao Huang ; Ge Liu ; Martin D. Burke ; Ying Diao ; Heng Ji

We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key innovation of FARM lies in its functional group-aware tokenization, which incorporates functional group information directly into the representations. This strategic reduction in tokenization granularity in a way that is intentionally interfaced with key drivers of functional properties (i.e., functional groups) enhances the model's understanding of chemical language, expands the chemical lexicon, more effectively bridging SMILES and natural language, and ultimately advances the model's capacity to predict molecular properties. FARM also represents molecules from two perspectives: by using masked language modeling to capture atom-level features and by employing graph neural networks to encode the whole molecule topology. By leveraging contrastive learning, FARM aligns these two views of representations into a unified molecular embedding. We rigorously evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 10 out of 12 tasks. These results highlight FARM's potential to improve molecular representation learning, with promising applications in drug discovery and pharmaceutical research.

Subjects: Machine Learning ; Quantitative Methods

Publish: 2024-10-02 23:04:58 UTC

#7 Recovering Time-Varying Networks From Single-Cell Data [PDF] [Copy] [Kimi] [REL]

Authors: Euxhen Hasanaj ; Barnabás Póczos ; Ziv Bar-Joseph

Gene regulation is a dynamic process that underlies all aspects of human development, disease response, and other key biological processes. The reconstruction of temporal gene regulatory networks has conventionally relied on regression analysis, graphical models, or other types of relevance networks. With the large increase in time series single-cell data, new approaches are needed to address the unique scale and nature of this data for reconstructing such networks. Here, we develop a deep neural network, Marlene, to infer dynamic graphs from time series single-cell gene expression data. Marlene constructs directed gene networks using a self-attention mechanism where the weights evolve over time using recurrent units. By employing meta learning, the model is able to recover accurate temporal networks even for rare cell types. In addition, Marlene can identify gene interactions relevant to specific biological responses, including COVID-19 immune response, fibrosis, and aging.

Subjects: Quantitative Methods ; Machine Learning

Publish: 2024-10-01 19:18:51 UTC

#8 An Efficient Inference Frame for SMLM (Single-Molecule Localization Microscopy) [PDF] [Copy] [Kimi] [REL]

Author: Tingdan Luo

Single-molecule localization microscopy (SMLM) surpasses the diffraction limit, achieving subcellular resolution. Traditional SMLM analysis methods often rely on point spread function (PSF) model fitting, limiting the application of complex PSF models. In recent years, deep learning approaches have significantly improved SMLM algorithms, yielding promising results. However, limitations in inference speed and model size have restricted the widespread adoption of deep learning in practical applications. To address these challenges, this paper proposes an efficient model deployment framework and introduces a lightweight neural network, DilatedLoc, aimed at enhancing both image reconstruction quality and inference speed. Compared to leading network models, DilatedLoc reduces network parameters to under 100 MB and achieves a 50% improvement in inference speed, with superior GPU utilization through a novel deployment architecture compatible with various network models.

Subjects: Quantitative Methods ; Computational Engineering, Finance, and Science ; Image and Video Processing

Publish: 2024-10-03 08:52:10 UTC