Quantitative Biology

2024-12-03 | | Total: 24

#1 The Thermodynamic Model to Study the Slow Afterhyperpolarization in a Single Neuron at Different ATP Levels [PDF] [Copy] [Kimi] [REL]

Authors: Jianwei Li, Simeng Yu, Mingye Guo, Xuewen Shen, Qi Ouyang, Fangting Li

The neuron consumes energy from ATP hydrolysis to maintain a far-from-equilibrium steady state inside the cell, thus all physiological functions inside the cell are modulated by thermodynamics. The neurons that manage information encoding, transferring, and processing with high energy consumption, displaying a phenomenon called slow afterhyperpolarization after burst firing, whose properties are affected by the energy conditions. Here we constructed a thermodynamical model to quantitatively describe the sAHP process generated by $Na^+-K^+$ ATPases(NKA) and the Calcium-activated potassium(K(Ca)) channels. The model simulates how the amplitude of sAHP is effected by the intracellular ATP concentration and ATP hydrolysis free energy $\Delta$ G. The results show a trade-off between NKA and the K(Ca)'s modulation on the sAHP's energy dependence, and also predict an alteration of sAHP's behavior under insufficient ATP supply if the proportion of NKA and K(Ca)'s expression quantities is changed. The research provides insights in understanding the maintenance of neural homeostasis and support furthur researches on metabolism-related and neurodegenerative diseases.

Subject: Neurons and Cognition

Publish: 2024-12-02 16:54:47 UTC


#2 Task learning through stimulation-induced plasticity in neural networks [PDF] [Copy] [Kimi] [REL]

Authors: Francesco Borra, Simona Cocco, Rémi Monasson

Synaptic plasticity dynamically shapes the connectivity of neural systems and is key to learning processes in the brain. To what extent the mechanisms of plasticity can be exploited to drive a neural network and make it perform some kind of computational task remains unclear. This question, relevant in a bioengineering context, can be formulated as a control problem on a high-dimensional system with strongly constrained and non-linear dynamics. We present a self-contained procedure which, through appropriate spatio-temporal stimulations of the neurons, is able to drive rate-based neural networks with arbitrary initial connectivity towards a desired functional state. We illustrate our approach on two different computational tasks: a non-linear association between multiple input stimulations and activity patterns (representing digit images), and the construction of a continuous attractor encoding a collective variable in a neural population. Our work thus provides a proof of principle for emerging paradigms of in vitro computation based on real neurons.

Subjects: Neurons and Cognition , Disordered Systems and Neural Networks

Publish: 2024-12-02 16:29:51 UTC


#3 Microbial Mat Metagenomes from Waikite Valley, Aotearoa New Zealand [PDF] [Copy] [Kimi] [REL]

Authors: Beatrice Tauer, Elizabeth Trembath-Reichert, L. M. Ward

The rise of complex multicellular ecosystems Neoproterozoic time was preceded by a microbial Proterozoic biosphere, where productivity may have been largely restricted to microbial mats made up of bacteria including oxygenic photosynthetic Cyanobacteria, anoxygenic phototrophs, and heterotrophs. In modern environments, analogous microbial mats can be found in restricted environments such as carbonate tidal flats and terrestrial hot springs. Here, we report metagenomic sequence data from an analog in the hot springs of Waikite Valley, Aotearoa New Zealand, where carbon-rich, slightly-alkaline geothermal waters support diverse phototrophic microbial mats. The Waikite Valley hot spring in the Taupo Volcanic Zone of Aotearoa New Zealand was sampled in duplicate at 8 points along a temperature gradient transect of the outflow, from ~62 C (near the source) to ~37 C (~100 meters downstream). ~686 Gb of shotgun metagenomic sequence data was generated by Illumina Novaseq. Each sample was assembled using SPAdes, followed by binning of metagenome-assembled genomes (MAGs) by MetaBAT. These data are useful for the genomic analysis of novel phototrophic bacteria, as well as for ecological comparisons between thermophilic communities with varying temperatures but otherwise similar conditions.

Subjects: Genomics , Populations and Evolution

Publish: 2024-12-02 15:59:16 UTC


#4 pasta: Pattern Analysis for Spatial Omics Data [PDF] [Copy] [Kimi] [REL]

Authors: Martin Emons, Samuel Gunz, Helena L. Crowell, Izaskun Mallona, Reinhard Furrer, Mark D. Robinson

Spatial omics assays allow for the molecular characterisation of cells in their spatial context. Notably, the two main technological streams, imaging-based and high-throughput sequencing-based, can give rise to very different data modalities. The characteristics of the two data types are well known in adjacent fields such as spatial statistics as point patterns and lattice data, and there is a wide range of tools available. This paper discusses the application of spatial statistics to spatially-resolved omics data and in particular, discusses various advantages, challenges, and nuances. This work is accompanied by a vignette, pasta, that showcases the usefulness of spatial statistics in biology using several R packages.

Subjects: Quantitative Methods , Genomics

Publish: 2024-12-02 14:50:13 UTC


#5 New Graphs at the braingraph.org Website for Studying the Aging Brain Circuitry [PDF] [Copy] [Kimi] [REL]

Authors: Balint Varga, Vince Grolmusz

Human braingraphs or connectomes are widely studied in the last decade to understand the structural and functional properties of our brain. In the last several years our research group has computed and deposited thousands of human braingraphs to the braingraph.org site, by applying public structural (diffusion) MRI data from young and healthy subjects. Here we describe a recent addition to the {\tt braingraph.org} site, which contains connectomes from healthy and demented subjects between 42 and 95 years of age, based on the public release of the OASIS-3 dataset. The diffusion MRI data was processed with the Connectome Mapper Toolkit v.3.1. We believe that the new addition to the braingraph.org site will become a useful resource for enlightening the aging circuitry of the human brain in healthy and diseased subjects, including those with Alzheimer's disease in several stages.

Subject: Neurons and Cognition

Publish: 2024-12-02 11:59:32 UTC


#6 The influence of chromosomal inversions on genetic variation and clinal patterns in genomic data of Drosophila melanogaster [PDF] [Copy] [Kimi] [REL]

Author: Martin Kapun

Chromosomal inversions are structural mutations resulting in the reversal of the gene order along the corresponding genomic region. Due to their influence on recombination patterns, they can have a major influence on genetic variation and the evolutionary process. Accordingly, inversions can act as supergenes that keep together co-adapted gene complexes that form the genetic basis of many complex phenotypes in diverse organisms. In this book chapter, I will present an analysis pipeline to investigate the influence of two common cosmopolitan inversion, In(2L)t and In(3R)Payne, on genome-wide genetic variation and differentiation in world-wide populations of the vinegar fly Drosophila melanogaster. We will use single-individual and pooled resequencing data in combination with population genomics analysis tools to explore the impact of these two inversions on genetic variation, population structure, and clinal variation in natural populations.

Subjects: Populations and Evolution , Genomics

Publish: 2024-12-02 10:29:52 UTC


#7 Dynamic Indicators of Adherence and Retention in Digital Health Studies: Insights from the Brighten Study [PDF] [Copy] [Kimi] [REL]

Authors: Dylan Hamitouche, Youcef Barkat, Deven Parekh, Eva Hammer, David Benrimoh

Background: Effective use of mobile health technologies requires high participant adherence and retention. However, remote digital health studies often face high attrition and low adherence, potentially introducing bias and limiting generalizability. Objective: This study aims to identify longitudinal indicators of participant retention and adherence to develop strategies for improving data collection in digital health studies and understanding how cohorts are shaped by participant withdrawal and non-adherence. Methods: We conducted analyses on the Brighten study, a smartphone-based randomized controlled trial evaluating apps for depression treatment. Participants were asked to complete seven digital questionnaires regularly. Outcomes included adherence (questionnaire completion), engagement (post-baseline participation), and retention (continued participation over time). We analyzed relationships between these outcomes, static factors (e.g., demographics, average questionnaire scores) and dynamic factors (e.g., questionnaire score changes over time). Results: Of 2,201 participants, 1,093 completed at least one non-baseline questionnaire (median completion rate: 37.6%). Adherence was higher among participants with lower average depression severity (P<.001) and those perceiving improvement (P=.001). Demographic factors significantly influenced adherence and engagement. Participants with greater baseline depressive symptoms were more likely to withdraw before completing non-baseline questionnaires (t=-2.53, P=.01). However, symptom improvement was linked to better adherence (U=127,084; P<.001) and retention (HR=0.78, P=.002). Conclusion: Clinical trajectories and perceived improvements in depressive symptoms are key indicators of engagement, adherence, and retention. These findings may enhance data interpretation and inform strategies to boost retention and adherence in future trials.

Subject: Quantitative Methods

Publish: 2024-12-01 19:27:40 UTC


#8 Mapping, modeling, and reprogramming cell-fate decision making systems [PDF] [Copy] [Kimi] [REL]

Authors: Lucy Ham, Taylor E. Woodford, Megan A. Coomer, Michael P. H. Stumpf

Many cellular processes involve information processing and decision making. We can probe these processes at increasing molecular detail. The analysis of heterogeneous data remains a challenge that requires new ways of thinking about cells in quantitative, predictive, and mechanistic ways. We discuss the role of mathematical models in the context of cell-fate decision making systems across the tree of life. Complex multi-cellular organisms have been a particular focus, but single celled organisms also have to sense and respond to their environment. We center our discussion around the idea of design principles which we can learn from observations and modeling, and exploit in order to (re)-design or guide cellular behavior.

Subject: Cell Behavior

Publish: 2024-12-01 04:16:50 UTC


#9 LLaMA-Gene: A General-purpose Gene Task Large Language Model Based on Instruction Fine-tuning [PDF] [Copy] [Kimi] [REL]

Author: Wang Liang

Building a general-purpose task model similar to ChatGPT has been an important research direction for gene large language models. Instruction fine-tuning is a key component in building ChatGPT, but existing instructions are primarily based on natural language. Natural language and gene sequences have significant differences in tokenization and encoding. Therefore, constructing a multilingual model that can handle both natural language and gene sequences is crucial for solving this problem.In this paper, we expand the capabilities of the LLaMA large language model to include gene language. This involves expanding the vocabulary using the Byte Pair Encoding (BPE) method, specifically tailored for DNA and protein sequences, and conducting further pre-training on these sequences. We then convert various downstream gene task data into a unified format for instruction fine-tuning and further fine-tune the model on this data.Our study demonstrates that a mixed model of gene and natural language, fine-tuned with instructions, achieves results comparable to the current state-of-the-art (SOTA) in tasks such as gene classification and gene sequence interaction. This provides a promising direction for building a unified large language model for gene tasks.

Subject: Genomics

Publish: 2024-11-30 13:10:39 UTC


#10 Stochastic Dynamics and Probability Analysis for a Generalized Epidemic Model with Environmental Noise [PDF] [Copy] [Kimi] [REL]

Authors: Brahim Boukanjimea, Mohamed Maama

In this paper we consider a stochastic SEIQR (susceptible-exposed-infected-quarantined-recovered) epidemic model with a generalized incidence function. Using the Lyapunov method, we establish the existence and uniqueness of a global positive solution to the model, ensuring that it remains well-defined over time. Through the application of Young's inequality and Chebyshev's inequality, we demonstrate the concepts of stochastic ultimate boundedness and stochastic permanence, providing insights into the long-term behavior of the epidemic dynamics under random perturbations. Furthermore, we derive conditions for stochastic extinction, which describe scenarios where the epidemic may eventually die out, and V-geometric ergodicity, which indicates the rate at which the system's state converges to its equilibrium. Finally, we perform numerical simulations to verify our theoretical results and assess the model's behavior under different parameters.

Subjects: Populations and Evolution , Dynamical Systems , Data Analysis, Statistics and Probability , Neurons and Cognition

Publish: 2024-11-30 09:10:47 UTC


#11 How reproducible are data-driven subtypes of Alzheimer's disease atrophy? [PDF] [Copy] [Kimi] [REL]

Authors: Emma Prevot, Cameron Shand, Neil Oxtoby, for Alzheimer's Disease Neuroimaging Initiative

Alzheimer's disease (AD) exhibits substantial clinical and biological heterogeneity, complicating efforts in treatment and intervention development. While new computational methods offer insights into AD progression, the reproducibility of these subtypes across datasets remains understudied, particularly concerning the robustness of subtype definitions when validated on diverse databases. This study evaluates the consistency of AD progression subtypes identified by the Subtype and Stage Inference (SuStaIn) algorithm using T1-weighted MRI data across 5,444 subjects from ANMerge, OASIS, and ADNI datasets, forming four independent cohorts. Each cohort was analyzed under two conditions: one using the full cohort, including cognitively normal controls, and another excluding controls to test subtype robustness. Results confirm the three primary atrophy subtypes identified in earlier studies: Typical, Cortical, and Subcortical, as well as the emergence of rare and atypical AD variants such as posterior cortical atrophy (PCA). Notably, each subtype displayed varying robustness to the inclusion of controls, with certain subtypes, like Subcortical, more influenced by cohort composition. This investigation underscores SuStaIn's reliability for defining stable AD subtypes and suggests its utility in clinical stratification for trials and diagnosis. However, our findings also highlight the need for improved dataset diversity, particularly in terms of ethnic representation, to enhance generalizability and support broader clinical application.

Subjects: Quantitative Methods , Applications

Publish: 2024-11-29 11:42:13 UTC


#12 The Copernican Argument for Alien Consciousness; The Mimicry Argument Against Robot Consciousness [PDF] [Copy] [Kimi] [REL]

Authors: Eric Schwitzgebel, Jeremy Pober

On broadly Copernican grounds, we are entitled to default assume that apparently behaviorally sophisticated extraterrestrial entities ("aliens") would be conscious. Otherwise, we humans would be inexplicably, implausibly lucky to have consciousness, while similarly behaviorally sophisticated entities elsewhere would be mere shells, devoid of consciousness. However, this Copernican default assumption is canceled in the case of behaviorally sophisticated entities designed to mimic superficial features associated with consciousness in humans ("consciousness mimics"), and in particular a broad class of current, near-future, and hypothetical robots. These considerations, which we formulate, respectively, as the Copernican and Mimicry Arguments, jointly defeat an otherwise potentially attractive parity principle, according to which we should apply the same types of behavioral or cognitive tests to aliens and robots, attributing or denying consciousness similarly to the extent they perform similarly. Instead of grounding speculations about alien and robot consciousness in metaphysical or scientific theories about the physical or functional bases of consciousness, our approach appeals directly to the epistemic principles of Copernican mediocrity and inference to the best explanation. This permits us to justify certain default assumptions about consciousness while remaining to a substantial extent neutral about specific metaphysical and scientific theories.

Subject: Neurons and Cognition

Publish: 2024-11-12 17:26:49 UTC


#13 Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates [PDF1] [Copy] [Kimi] [REL]

Authors: Kaiyuan Gao, Yusong Wang, Haoxiang Guan, Zun Wang, Qizhi Pei, John E. Hopcroft, Kun He, Lijun Wu

The application of language models (LMs) to molecular structure generation using line notations such as SMILES and SELFIES has been well-established in the field of cheminformatics. However, extending these models to generate 3D molecular structures presents significant challenges. Two primary obstacles emerge: (1) the difficulty in designing a 3D line notation that ensures SE(3)-invariant atomic coordinates, and (2) the non-trivial task of tokenizing continuous coordinates for use in LMs, which inherently require discrete inputs. To address these challenges, we propose Mol-StrucTok, a novel method for tokenizing 3D molecular structures. Our approach comprises two key innovations: (1) We design a line notation for 3D molecules by extracting local atomic coordinates in a spherical coordinate system. This notation builds upon existing 2D line notations and remains agnostic to their specific forms, ensuring compatibility with various molecular representation schemes. (2) We employ a Vector Quantized Variational Autoencoder (VQ-VAE) to tokenize these coordinates, treating them as generation descriptors. To further enhance the representation, we incorporate neighborhood bond lengths and bond angles as understanding descriptors. Leveraging this tokenization framework, we train a GPT-2 style model for 3D molecular generation tasks. Results demonstrate strong performance with significantly faster generation speeds and competitive chemical stability compared to previous methods. Further, by integrating our learned discrete representations into Graphormer model for property prediction on QM9 dataset, Mol-StrucTok reveals consistent improvements across various molecular properties, underscoring the versatility and robustness of our approach.

Subjects: Machine Learning , Biomolecules

Publish: 2024-12-02 14:50:44 UTC


#14 CellSeg1: Robust Cell Segmentation with One Training Image [PDF] [Copy] [Kimi] [REL]

Authors: Peilin Zhou, Bo Du, Yongchao Xu

Recent trends in cell segmentation have shifted towards universal models to handle diverse cell morphologies and imaging modalities. However, for continuously emerging cell types and imaging techniques, these models still require hundreds or thousands of annotated cells for fine-tuning. We introduce CellSeg1, a practical solution for segmenting cells of arbitrary morphology and modality with a few dozen cell annotations in 1 image. By adopting Low-Rank Adaptation of the Segment Anything Model (SAM), we achieve robust cell segmentation. Tested on 19 diverse cell datasets, CellSeg1 trained on 1 image achieved 0.81 average mAP at 0.5 IoU, performing comparably to existing models trained on over 500 images. It also demonstrated superior generalization in cross-dataset tests on TissueNet. We found that high-quality annotation of a few dozen densely packed cells of varied sizes is key to effective segmentation. CellSeg1 provides an efficient solution for cell segmentation with minimal annotation effort.

Subjects: Computer Vision and Pattern Recognition , Quantitative Methods

Publish: 2024-12-02 11:55:22 UTC


#15 SUICA: Learning Super-high Dimensional Sparse Implicit Neural Representations for Spatial Transcriptomics [PDF] [Copy] [Kimi] [REL]

Authors: Qingtian Zhu, Yumin Zheng, Yuling Sang, Yifan Zhan, Ziyan Zhu, Jun Ding, Yinqiang Zheng

Spatial Transcriptomics (ST) is a method that captures spatial gene expression profiles within histological sections. The discrete spatial distribution and the super-high dimensional sequencing results make ST data challenging to be modeled effectively. In this paper, we manage to model ST in a continuous and compact manner by the proposed tool, SUICA, empowered by the great approximation capability of Implicit Neural Representations (INRs) that can improve both the spatial resolution and the gene expression. Concretely within the proposed SUICA, we incorporate a graph-augmented Autoencoder to effectively model the context information of the unstructured spots and provide informative embeddings that are structure-aware for spatial mapping. We also tackle the extremely skewed distribution in a regression-by-classification fashion and enforce classification-based loss functions for the optimization of SUICA. By extensive experiments of a wide range of common ST platforms, SUICA outperforms both conventional INR variants and SOTA methods for ST super-resolution regarding numerical fidelity, statistical correlation, and bio-conservation. The prediction by SUICA also showcases amplified gene signatures that enriches the bio-conservation of the raw data and benefits subsequent analysis. The code is available at https://github.com/Szym29/SUICA.

Subjects: Machine Learning , Genomics

Publish: 2024-12-02 05:02:18 UTC


#16 Simplified derivations for high-dimensional convex learning problems [PDF] [Copy] [Kimi] [REL]

Authors: David G. Clark, Haim Sompolinsky

Statistical physics provides tools for analyzing high-dimensional problems in machine learning and theoretical neuroscience. These calculations, particularly those using the replica method, often involve lengthy derivations that can obscure physical interpretation. We give concise, non-replica derivations of several key results and highlight their underlying similarities. Specifically, we introduce a cavity approach to analyzing high-dimensional learning problems and apply it to three cases: perceptron classification of points, perceptron classification of manifolds, and kernel ridge regression. These problems share a common structure -- a bipartite system of interacting feature and datum variables -- enabling a unified analysis. For perceptron-capacity problems, we identify a symmetry that allows derivation of correct capacities through a naïve method. These results match those obtained through the replica method.

Subjects: Disordered Systems and Neural Networks , Neural and Evolutionary Computing , Neurons and Cognition

Publish: 2024-12-02 04:32:14 UTC


#17 Multi-Scale Representation Learning for Protein Fitness Prediction [PDF] [Copy] [Kimi] [REL]

Authors: Zuobai Zhang, Pascal Notin, Yining Huang, Aurélie Lozano, Vijil Chenthamarakshan, Debora Marks, Payel Das, Jian Tang

Designing novel functional proteins crucially depends on accurately modeling their fitness landscape. Given the limited availability of functional annotations from wet-lab experiments, previous methods have primarily relied on self-supervised models trained on vast, unlabeled protein sequence or structure datasets. While initial protein representation learning studies solely focused on either sequence or structural features, recent hybrid architectures have sought to merge these modalities to harness their respective strengths. However, these sequence-structure models have so far achieved only incremental improvements when compared to the leading sequence-only approaches, highlighting unresolved challenges effectively leveraging these modalities together. Moreover, the function of certain proteins is highly dependent on the granular aspects of their surface topology, which have been overlooked by prior models. To address these limitations, we introduce the Sequence-Structure-Surface Fitness (S3F) model - a novel multimodal representation learning framework that integrates protein features across several scales. Our approach combines sequence representations from a protein language model with Geometric Vector Perceptron networks encoding protein backbone and detailed surface topology. The proposed method achieves state-of-the-art fitness prediction on the ProteinGym benchmark encompassing 217 substitution deep mutational scanning assays, and provides insights into the determinants of protein function. Our code is at https://github.com/DeepGraphLearning/S3F.

Subjects: Machine Learning , Biomolecules

Publish: 2024-12-02 04:28:10 UTC


#18 Toric Multivariate Gaussian Models from Symmetries in a Tree [PDF] [Copy] [Kimi] [REL]

Authors: Emma Cardwell, Aida Maraj, Alvaro Ribot

Given a rooted tree $T$ on $n$ non-root leaves with colored and zeroed nodes, we construct a linear space $L_T$ of $n\times n$ symmetric matrices with constraints determined by the combinatorics of the tree. When $L_T$ represents the covariance matrices of a Gaussian model, it provides natural generalizations of Brownian motion tree (BMT) models in phylogenetics. When $L_T$ represents a space of concentration matrices of a Gaussian model, it gives certain colored Gaussian graphical models, which we refer to as BMT derived models. We investigate conditions under which the reciprocal variety $L_T^{-1}$ is toric. Relying on the birational isomorphism of the inverse matrix map, we show that if the BMT derived graph of $T$ is vertex-regular and a block graph, under the derived Laplacian transformation, $L_T^{-1}$ is the vanishing locus of a toric ideal. This ideal is given by the sum of the toric ideal of the Gaussian graphical model on the block graph, the toric ideal of the original BMT model, and binomial linear conditions coming from vertex-regularity. To this end, we provide monomial parametrizations for these toric models realized through paths among leaves in $T$.

Subjects: Algebraic Geometry , Combinatorics , Statistics Theory , Populations and Evolution

Publish: 2024-12-01 17:12:24 UTC


#19 Generative Model for Synthesizing Ionizable Lipids: A Monte Carlo Tree Search Approach [PDF] [Copy] [Kimi] [REL]

Authors: Jingyi Zhao, Yuxuan Ou, Austin Tripp, Morteza Rasoulianboroujeni, José Miguel Hernández-Lobato

Ionizable lipids are essential in developing lipid nanoparticles (LNPs) for effective messenger RNA (mRNA) delivery. While traditional methods for designing new ionizable lipids are typically time-consuming, deep generative models have emerged as a powerful solution, significantly accelerating the molecular discovery process. However, a practical challenge arises as the molecular structures generated can often be difficult or infeasible to synthesize. This project explores Monte Carlo tree search (MCTS)-based generative models for synthesizable ionizable lipids. Leveraging a synthetically accessible lipid building block dataset and two specialized predictors to guide the search through chemical space, we introduce a policy network guided MCTS generative model capable of producing new ionizable lipids with available synthesis pathways.

Subjects: Machine Learning , Artificial Intelligence , Biomolecules , Quantitative Methods

Publish: 2024-12-01 13:34:22 UTC


#20 The ecological forecast horizon revisited: Potential, actual and relative system predictability [PDF] [Copy] [Kimi] [REL]

Authors: Marieke Wesselkamp, Jakob Albrecht, Ewan Pinnington, William J. Castillo, Florian Pappenberger, Carsten F. Dormann

Ecological forecasts are model-based statements about currently unknown ecosystem states in time or space. For a model forecast to be useful to inform decision-makers, model validation and verification determine adequateness. The measure of forecast goodness that can be translated into a limit up to which a forecast is acceptable is known as the `forecast horizon'. While verification of meteorological models follows strict criteria with established metrics and forecast horizons, assessments of ecological forecasting models still remain experiment-specific and forecast horizons are rarely reported. As such, users of ecological forecasts remain uninformed of how far into the future statements can be trusted. In this work, we synthesise existing approaches, define empirical forecast horizons in a unified framework for assessing ecological predictability and offer recipes on their computation. We distinguish upper and lower boundary estimates of predictability limits, reflecting the model's potential and actual forecast horizon, and show how a benchmark model can help determine its relative forecast horizon. The approaches are demonstrated with four case studies from population, ecosystem, and earth system research.

Subjects: Applications , Data Analysis, Statistics and Probability , Populations and Evolution , Methodology

Publish: 2024-12-01 10:14:42 UTC


#21 Towards Unified Molecule-Enhanced Pathology Image Representation Learning via Integrating Spatial Transcriptomics [PDF] [Copy] [Kimi] [REL]

Authors: Minghao Han, Dingkang Yang, Jiabei Cheng, Xukun Zhang, Linhao Qu, Zizhi Chen, Lihua Zhang

Recent advancements in multimodal pre-training models have significantly advanced computational pathology. However, current approaches predominantly rely on visual-language models, which may impose limitations from a molecular perspective and lead to performance bottlenecks. Here, we introduce a Unified Molecule-enhanced Pathology Image REpresentationn Learning framework (UMPIRE). UMPIRE aims to leverage complementary information from gene expression profiles to guide the multimodal pre-training, enhancing the molecular awareness of pathology image representation learning. We demonstrate that this molecular perspective provides a robust, task-agnostic training signal for learning pathology image embeddings. Due to the scarcity of paired data, approximately 4 million entries of spatial transcriptomics gene expression were collected to train the gene encoder. By leveraging powerful pre-trained encoders, UMPIRE aligns the encoders across over 697K pathology image-gene expression pairs. The performance of UMPIRE is demonstrated across various molecular-related downstream tasks, including gene expression prediction, spot classification, and mutation state prediction in whole slide images. Our findings highlight the effectiveness of multimodal data integration and open new avenues for exploring computational pathology enhanced by molecular perspectives. The code and pre-trained weights are available at https://github.com/Hanminghao/UMPIRE.

Subjects: Computer Vision and Pattern Recognition , Genomics

Publish: 2024-12-01 03:09:52 UTC


#22 Spatial Clustering of Molecular Localizations with Graph Neural Networks [PDF] [Copy] [Kimi] [REL]

Authors: Jesús Pineda, Sergi Masó-Orriols, Joan Bertran, Mattias Goksör, Giovanni Volpe, Carlo Manzo

Single-molecule localization microscopy generates point clouds corresponding to fluorophore localizations. Spatial cluster identification and analysis of these point clouds are crucial for extracting insights about molecular organization. However, this task becomes challenging in the presence of localization noise, high point density, or complex biological structures. Here, we introduce MIRO (Multimodal Integration through Relational Optimization), an algorithm that uses recurrent graph neural networks to transform the point clouds in order to improve clustering efficiency when applying conventional clustering techniques. We show that MIRO supports simultaneous processing of clusters of different shapes and at multiple scales, demonstrating improved performance across varied datasets. Our comprehensive evaluation demonstrates MIRO's transformative potential for single-molecule localization applications, showcasing its capability to revolutionize cluster analysis and provide accurate, reliable details of molecular architecture. In addition, MIRO's robust clustering capabilities hold promise for applications in various fields such as neuroscience, for the analysis of neural connectivity patterns, and environmental science, for studying spatial distributions of ecological data.

Subjects: Machine Learning , Biological Physics , Data Analysis, Statistics and Probability , Quantitative Methods

Publish: 2024-11-29 17:43:57 UTC


#23 Deep Neural Network-Based Prediction of B-Cell Epitopes for SARS-CoV and SARS-CoV-2: Enhancing Vaccine Design through Machine Learning [PDF] [Copy] [Kimi] [REL]

Authors: Xinyu Shi, Yixin Tao, Shih-Chi Lin

The accurate prediction of B-cell epitopes is critical for guiding vaccine development against infectious diseases, including SARS and COVID-19. This study explores the use of a deep neural network (DNN) model to predict B-cell epitopes for SARS-CoVandSARS-CoV-2,leveraging a dataset that incorporates essential protein and peptide features. Traditional sequence-based methods often struggle with large, complex datasets, but deep learning offers promising improvements in predictive accuracy. Our model employs regularization techniques, such as dropout and early stopping, to enhance generalization, while also analyzing key features, including isoelectric point and aromaticity, that influence epitope recognition. Results indicate an overall accuracy of 82% in predicting COVID-19 negative and positive cases, with room for improvement in detecting positive samples. This research demonstrates the applicability of deep learning in epitope mapping, suggesting that such approaches can enhance the speed and precision of vaccine design for emerging pathogens. Future work could incorporate structural data and diverse viral strains to further refine prediction capabilities.

Subjects: Machine Learning , Computational Engineering, Finance, and Science , Biomolecules , Machine Learning

Publish: 2024-11-28 01:54:43 UTC


#24 Differential learning kinetics govern the transition from memorization to generalization during in-context learning [PDF] [Copy] [Kimi] [REL]

Authors: Alex Nguyen, Gautam Reddy

Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates. Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks and the transition from memorization to generalization is sharp with increasing task diversity. One interpretation is that a network's limited capacity to memorize favors generalization. Here, we examine the mechanistic underpinnings of this transition using a small transformer applied to a synthetic ICL task. Using theory and experiment, we show that the sub-circuits that memorize and generalize can be viewed as largely independent. The relative rates at which these sub-circuits learn explains the transition from memorization to generalization, rather than capacity constraints. We uncover a memorization scaling law, which determines the task diversity threshold at which the network generalizes. The theory quantitatively explains a variety of other ICL-related phenomena, including the long-tailed distribution of when ICL is acquired, the bimodal behavior of solutions close to the task diversity threshold, the influence of contextual and data distributional statistics on ICL, and the transient nature of ICL.

Subjects: Machine Learning , Disordered Systems and Neural Networks , Artificial Intelligence , Neural and Evolutionary Computing , Neurons and Cognition

Publish: 2024-11-27 22:12:29 UTC