Quantitative Methods

Date: Fri, 21 Jun 2024 | Total: 13

#1 MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction [PDF5] [Copy] [Kimi1]

Authors: Yuyan Liu ; Sirui Ding ; Sheng Zhou ; Wenqi Fan ; Qiaoyu Tan

Molecular property prediction (MPP) is a fundamental and crucial task in drug discovery. However, prior methods are limited by the requirement for a large number of labeled molecules and their restricted ability to generalize for unseen and new tasks, both of which are essential for real-world applications. To address these challenges, we present MolecularGPT for few-shot MPP. From a perspective on instruction tuning, we fine-tune large language models (LLMs) based on curated molecular instructions spanning over 1000 property prediction tasks. This enables building a versatile and specialized LLM that can be adapted to novel MPP tasks without any fine-tuning through zero- and few-shot in-context learning (ICL). MolecularGPT exhibits competitive in-context reasoning capabilities across 10 downstream evaluation datasets, setting new benchmarks for few-shot molecular prediction tasks. More importantly, with just two-shot examples, MolecularGPT can outperform standard supervised graph neural network methods on 4 out of 7 datasets. It also excels state-of-the-art LLM baselines by up to 16.6% increase on classification accuracy and decrease of 199.17 on regression metrics (e.g., RMSE) under zero-shot. This study demonstrates the potential of LLMs as effective few-shot molecular property predictors. The code is available at https://github.com/NYUSHCS/MolecularGPT.

Subjects: Quantitative Methods ; Artificial Intelligence ; Computational Engineering, Finance, and Science ; Computation and Language ; Machine Learning

Publish: 2024-06-18 12:54:47 UTC

#2 HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment [PDF2] [Copy] [Kimi1]

Authors: Yongqiang Chen ; Quanming Yao ; Juzheng Zhang ; James Cheng ; Yatao Bian

Recently there has been a surge of interest in extending the success of large language models (LLMs) to graph modality, such as social networks and molecules. As LLMs are predominantly trained with 1D text data, most existing approaches adopt a graph neural network to represent a graph as a series of node tokens and feed these tokens to LLMs for graph-language alignment. Despite achieving some successes, existing approaches have overlooked the hierarchical structures that are inherent in graph data. Especially, in molecular graphs, the high-order structural information contains rich semantics of molecular functional groups, which encode crucial biochemical functionalities of the molecules. We establish a simple benchmark showing that neglecting the hierarchical information in graph tokenization will lead to subpar graph-language alignment and severe hallucination in generated outputs. To address this problem, we propose a novel strategy called HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that extracts and encodes the hierarchy of node, motif, and graph levels of informative tokens to improve the graph perception of LLMs. HIGHT also adopts an augmented graph-language supervised fine-tuning dataset, enriched with the hierarchical graph information, to further enhance the graph-language alignment. Extensive experiments on 7 molecule-centric benchmarks confirm the effectiveness of HIGHT in reducing hallucination by 40%, as well as significant improvements in various molecule-language downstream tasks.

Subjects: Computation and Language ; Machine Learning ; Quantitative Methods

Publish: 2024-06-20 06:37:35 UTC

#3 AntibodyFlow: Normalizing Flow Model for Designing Antibody Complementarity-Determining Regions [PDF1] [Copy] [Kimi]

Authors: Bohao Xu ; Yanbo Wang ; Wenyu Chen ; Shimin Shan

Therapeutic antibodies have been extensively studied in drug discovery and development in the past decades. Antibodies are specialized protective proteins that bind to antigens in a lock-to-key manner. The binding strength/affinity between an antibody and a specific antigen is heavily determined by the complementarity-determining regions (CDRs) on the antibodies. Existing machine learning methods cast in silico development of CDRs as either sequence or 3D graph (with a single chain) generation tasks and have achieved initial success. However, with CDR loops having specific geometry shapes, learning the 3D geometric structures of CDRs remains a challenge. To address this issue, we propose AntibodyFlow, a 3D flow model to design antibody CDR loops. Specifically, AntibodyFlow first constructs the distance matrix, then predicts amino acids conditioned on the distance matrix. Also, AntibodyFlow conducts constraint learning and constrained generation to ensure valid 3D structures. Experimental results indicate that AntibodyFlow outperforms the best baseline consistently with up to 16.0% relative improvement in validity rate and 24.3% relative reduction in geometric graph level error (root mean square deviation, RMSD).

Subjects: Machine Learning ; Artificial Intelligence ; Quantitative Methods

Publish: 2024-06-19 02:31:23 UTC

#4 The association of domain-specific physical activity and sedentary activity with stroke: A prospective cohort study [PDF] [Copy] [Kimi]

Authors: Xinyi He ; Shidi Wang ; Yi Li ; Jiucun Wang ; Guangrui Yang ; Jun Chen ; Zixin Hu

Background The incidence of stroke places a heavy burden on both society and individuals. Activity is closely related to cardiovascular health. This study aimed to investigate the relationship between the varying domains of PA, like occupation-related Physical Activity (OPA), transportation-related Physical Activity (TPA), leisure-time Physical Activity (LTPA), and Sedentary Activity (SA) with stroke. Methods Our analysis included 30,400 participants aged 20+ years from 2007 to 2018 National Health and Nutrition Examination Survey (NHANES). Stroke was identified based on the participant's self-reported diagnoses from previous medical consultations, and PA and SA were self-reported. Multivariable logistic and restricted cubic spline models were used to assess the associations. Results Participants achieving PA guidelines (performing PA more than 150 min/week) were 35.7% less likely to have a stroke based on both the total PA (odds ratio [OR] 0.643, 95% confidence interval [CI] 0.523-0.790) and LTPA (OR 0.643, 95% CI 0.514-0.805), while OPA or TPA did not demonstrate lower stroke risk. Furthermore, participants with less than 7.5 h/day SA levels were 21.6% (OR 0.784, 95% CI 0.665-0.925) less likely to have a stroke. The intensities of total PA and LTPA exhibited nonlinear U-shaped associations with stroke risk. In contrast, those of OPA and TPA showed negative linear associations, while SA intensities were positively linearly correlated with stroke risk. Conclusions LTPA, but not OPA or TPA, was associated with a lower risk of stroke at any amount, suggesting that significant cardiovascular health would benefit from increased PA. Additionally, the positive association between SA and stroke indicated that prolonged sitting was detrimental to cardiovascular health. Overall, increased PA within a reasonable range reduces the risk of stroke, while increased SA elevates it.

Subjects: Medical Physics ; Quantitative Methods

Publish: 2024-06-19 07:25:00 UTC

#5 Kinetic Monte Carlo methods for three-dimensional diffusive capture problems in exterior domains [PDF] [Copy] [Kimi]

Authors: Alan E. Lindsay ; Andrew J. Bernoff

Cellular scale decision making is modulated by the dynamics of signalling molecules and their diffusive trajectories from a source to small absorbing sites on the cellular surface. Diffusive capture problems are computationally challenging due to the complex geometry and the applied boundary conditions together with intrinsically long transients that occur before a particle is captured. This paper reports on a particle-based Kinetic Monte Carlo (KMC) method that provides rapid accurate simulation of arrival statistics for (i) a half-space bounded by a surface with a finite collection of absorbing traps and (ii) the domain exterior to a convex cell again with absorbing traps. We validate our method by replicating classical results and in addition, newly developed boundary homogenization theories and matched asymptotic expansions on capture rates. In the case of non-spherical domains, we describe a new shielding effect in which geometry can play a role in sharpening cellular estimates on the directionality of diffusive sources.

Subjects: Numerical Analysis ; Numerical Analysis ; Analysis of PDEs ; Biological Physics ; Quantitative Methods

Publish: 2024-06-19 15:48:05 UTC

#6 Segmentation of Non-Small Cell Lung Carcinomas: Introducing DRU-Net and Multi-Lens Distortion [PDF] [Copy] [Kimi]

Authors: Soroush Oskouei ; Marit Valla ; André Pedersen ; Erik Smistad ; Vibeke Grotnes Dale ; Maren Høibø ; Sissel Gyrid Freim Wahl ; Mats Dehli Haugum ; Thomas Langø ; Maria Paula Ramnefjell ; Lars Andreas Akslen ; Gabriel Kiss ; Hanne Sorger

Considering the increased workload in pathology laboratories today, automated tools such as artificial intelligence models can help pathologists with their tasks and ease the workload. In this paper, we are proposing a segmentation model (DRU-Net) that can provide a delineation of human non-small cell lung carcinomas and an augmentation method that can improve classification results. The proposed model is a fused combination of truncated pre-trained DenseNet201 and ResNet101V2 as a patch-wise classifier followed by a lightweight U-Net as a refinement model. We have used two datasets (Norwegian Lung Cancer Biobank and Haukeland University Hospital lung cancer cohort) to create our proposed model. The DRU-Net model achieves an average of 0.91 Dice similarity coefficient. The proposed spatial augmentation method (multi-lens distortion) improved the network performance by 3%. Our findings show that choosing image patches that specifically include regions of interest leads to better results for the patch-wise classifier compared to other sampling methods. The qualitative analysis showed that the DRU-Net model is generally successful in detecting the tumor. On the test set, some of the cases showed areas of false positive and false negative segmentation in the periphery, particularly in tumors with inflammatory and reactive changes.

Subjects: Image and Video Processing ; Computer Vision and Pattern Recognition ; Machine Learning ; Quantitative Methods

Publish: 2024-06-20 13:14:00 UTC

#7 Integrating time-resolved $nrf2$ gene-expression data into a full GUTS model as a proxy for toxicodynamic damage in zebrafish embryo [PDF] [Copy] [Kimi]

Authors: Florian Schunck ; Bernhard Kodritsch ; Wibke Busch ; Martin Krauss ; Andreas Focks

The immense production of the chemical industry requires an improved predictive risk assessment that can handle constantly evolving challenges while reducing the dependency of risk assessment on animal testing. Integrating 'omics data into mechanistic models offers a promising solution by linking cellular processes triggered after chemical exposure with observed effects in the organism. With the emerging availability of time-resolved RNA data, the goal of integrating gene expression data into mechanistic models can be approached. We propose a biologically anchored TKTD model, which describes key processes that link the gene expression level of the stress regulator $nrf2$ to detoxification and lethality by associating toxicodynamic damage with $nrf2$ expression. Fitting such a model to complex datasets consisting of multiple endpoints required the combination of methods from molecular biology, mechanistic dynamic systems modeling and Bayesian inference. In this study we successfully integrate time-resolved gene expression data into TKTD models, and thus provide a method for assessing the influence of molecular markers on survival. This novel method was used to test whether, $nrf2$, can be applied to predict lethality in zebrafish embryos. With the presented approach we outline a method to successively approach the goal of a predictive risk assessment based on molecular data.

Subjects: Quantitative Methods ; Dynamical Systems ; Applications

Publish: 2024-06-18 12:28:38 UTC

#8 An interpretable generative multimodal neuroimaging-genomics framework for decoding Alzheimer's disease [PDF] [Copy] [Kimi]

Authors: Giorgio Dolci ; Federica Cruciani ; Md Abdur Rahaman ; Anees Abrol ; Jiayu Chen ; Zening Fu ; Ilaria Boscolo Galazzo ; Gloria Menegaz ; Vince D. Calhoun

Alzheimer's disease (AD) is the most prevalent form of dementia with a progressive decline in cognitive abilities. The AD continuum encompasses a prodormal stage known as Mild Cognitive Impairment (MCI), where patients may either progress to AD or remain stable. In this study, we leveraged structural and functional MRI to investigate the disease-induced grey matter and functional network connectivity changes. Moreover, considering AD's strong genetic component, we introduce SNPs as a third channel. Given such diverse inputs, missing one or more modalities is a typical concern of multimodal methods. We hence propose a novel deep learning-based classification framework where generative module employing Cycle GANs was adopted to impute missing data within the latent space. Additionally, we adopted an Explainable AI method, Integrated Gradients, to extract input features relevance, enhancing our understanding of the learned representations. Two critical tasks were addressed: AD detection and MCI conversion prediction. Experimental results showed that our model was able to reach the SOA in the classification of CN/AD reaching an average test accuracy of $0.926\pm0.02$. For the MCI task, we achieved an average prediction accuracy of $0.711\pm0.01$ using the pre-trained model for CN/AD. The interpretability analysis revealed significant grey matter modulations in cortical and subcortical brain areas well known for their association with AD. Moreover, impairments in sensory-motor and visual resting state network connectivity along the disease continuum, as well as mutations in SNPs defining biological processes linked to amyloid-beta and cholesterol formation clearance and regulation, were identified as contributors to the achieved performance. Overall, our integrative deep learning approach shows promise for AD detection and MCI prediction, while shading light on important biological insights.

Subjects: Quantitative Methods ; Artificial Intelligence ; Image and Video Processing

Publish: 2024-06-19 07:31:47 UTC

#9 Efficient gPC-based quantification of probabilistic robustness for systems in neuroscience [PDF] [Copy] [Kimi]

Authors: Uros Sutulovic ; Daniele Proverbio ; Rami Katz ; Giulia Giordano

We introduce and analyze generalised polynomial chaos (gPC), considering both intrusive and non-intrusive approaches, as an uncertainty quantification method in studies of probabilistic robustness. The considered gPC methods are complementary to Monte Carlo (MC) methods and are shown to be fast and scalable, allowing for comprehensive and efficient exploration of parameter spaces. These properties enable robustness analysis of a wider set of models, compared to computationally expensive MC methods, while retaining desired levels of accuracy. We discuss the application of gPC methods to systems in biology and neuroscience, notably subject to multiple parametric uncertainties, and we examine a well-known model of neural dynamics as a case study.

Subject: Quantitative Methods

Publish: 2024-06-19 12:19:03 UTC

#10 Network-community analysis of cellular senescence [PDF] [Copy] [Kimi]

Authors: Alda Sabalic ; Victoria Moiseeva ; Andres Cisneros ; Oleg Deryagin ; Eusebio Perdiguero ; Pura Muñoz-Canoves ; Jordi Garcia-Ojalvo

Most cellular phenotypes are genetically complex. Identifying the set of genes that are most closely associated with a specific cellular state is still an open question in many cases. Here we study the transcriptional profile of cellular senescence using a combination of network-based approaches, which include eigenvector centrality feature selection and community detection. We apply our method to cell-type-resolved RNA sequencing data obtained from injured muscle tissue in mice. The analysis identifies some genetic markers consistent with previous findings, and other previously unidentified ones, which are validated with previously published single-cell RNA sequencing data in a different type of tissue. The key identified genes, both those previously known and the newly identified ones, are transcriptional targets of factors known to be associated with established hallmarks of senescence, and can thus be interpreted as molecular correlates of such hallmarks. The method proposed here could be applied to any complex cellular phenotype even when only bulk RNA sequencing is available, provided the data is resolved by cell type.

Subject: Quantitative Methods

Publish: 2024-06-19 23:41:10 UTC

#11 An agent-based model of behaviour change calibrated to reversal learning data [PDF] [Copy] [Kimi]

Authors: Roben Delos Reyes ; Hugo Lyons Keenan ; Cameron Zachreson

Behaviour change lies at the heart of many observable collective phenomena such as the transmission and control of infectious diseases, adoption of public health policies, and migration of animals to new habitats. Representing the process of individual behaviour change in computer simulations of these phenomena remains an open challenge. Often, computational models use phenomenological implementations with limited support from behavioural data. Without a strong connection to observable quantities, such models have limited utility for simulating observed and counterfactual scenarios of emergent phenomena because they cannot be validated or calibrated. Here, we present a simple stochastic individual-based model of reversal learning that captures fundamental properties of individual behaviour change, namely, the capacity to learn based on accumulated reward signals, and the transient persistence of learned behaviour after rewards are removed or altered. The model has only two parameters, and we use approximate Bayesian computation to demonstrate that they are fully identifiable from empirical reversal learning time series data. Finally, we demonstrate how the model can be extended to account for the increased complexity of behavioural dynamics over longer time scales involving fluctuating stimuli. This work is a step towards the development and evaluation of fully identifiable individual-level behaviour change models that can function as validated submodels for complex simulations of collective behaviour change.

Subjects: Quantitative Methods ; Biological Physics ; Computation

Publish: 2024-06-20 07:38:08 UTC

#12 Geometric Self-Supervised Pretraining on 3D Protein Structures using Subgraphs [PDF] [Copy] [Kimi]

Authors: Michail Chatzianastasis ; George Dasoulas ; Michalis Vazirgiannis

Protein representation learning aims to learn informative protein embeddings capable of addressing crucial biological questions, such as protein function prediction. Although sequence-based transformer models have shown promising results by leveraging the vast amount of protein sequence data in a self-supervised way, there is still a gap in applying these methods to 3D protein structures. In this work, we propose a pre-training scheme going beyond trivial masking methods leveraging 3D and hierarchical structures of proteins. We propose a novel self-supervised method to pretrain 3D graph neural networks on 3D protein structures, by predicting the distances between local geometric centroids of protein subgraphs and the global geometric centroid of the protein. The motivation for this method is twofold. First, the relative spatial arrangements and geometric relationships among different regions of a protein are crucial for its function. Moreover, proteins are often organized in a hierarchical manner, where smaller substructures, such as secondary structure elements, assemble into larger domains. By considering subgraphs and their relationships to the global protein structure, the model can learn to reason about these hierarchical levels of organization. We experimentally show that our proposed pertaining strategy leads to significant improvements in the performance of 3D GNNs in various protein classification tasks.

Subjects: Quantitative Methods ; Machine Learning ; Biomolecules

Publish: 2024-06-20 09:34:31 UTC

#13 Non-Negative Universal Differential Equations With Applications in Systems Biology [PDF] [Copy] [Kimi]

Authors: Maren Philipps ; Antonia Körner ; Jakob Vanhoefer ; Dilan Pathirana ; Jan Hasenauer

Universal differential equations (UDEs) leverage the respective advantages of mechanistic models and artificial neural networks and combine them into one dynamic model. However, these hybrid models can suffer from unrealistic solutions, such as negative values for biochemical quantities. We present non-negative UDE (nUDEs), a constrained UDE variant that guarantees non-negative values. Furthermore, we explore regularisation techniques to improve generalisation and interpretability of UDEs.

Subjects: Quantitative Methods ; Machine Learning ; Dynamical Systems ; Machine Learning

Publish: 2024-06-20 12:14:09 UTC