2024-12-03 | | Total: 24
The neuron consumes energy from ATP hydrolysis to maintain a far-from-equilibrium steady state inside the cell, thus all physiological functions inside the cell are modulated by thermodynamics. The neurons that manage information encoding, transferring, and processing with high energy consumption, displaying a phenomenon called slow afterhyperpolarization after burst firing, whose properties are affected by the energy conditions. Here we constructed a thermodynamical model to quantitatively describe the sAHP process generated by $Na^+-K^+$ ATPases(NKA) and the Calcium-activated potassium(K(Ca)) channels. The model simulates how the amplitude of sAHP is effected by the intracellular ATP concentration and ATP hydrolysis free energy $\Delta$ G. The results show a trade-off between NKA and the K(Ca)'s modulation on the sAHP's energy dependence, and also predict an alteration of sAHP's behavior under insufficient ATP supply if the proportion of NKA and K(Ca)'s expression quantities is changed. The research provides insights in understanding the maintenance of neural homeostasis and support furthur researches on metabolism-related and neurodegenerative diseases.
Synaptic plasticity dynamically shapes the connectivity of neural systems and is key to learning processes in the brain. To what extent the mechanisms of plasticity can be exploited to drive a neural network and make it perform some kind of computational task remains unclear. This question, relevant in a bioengineering context, can be formulated as a control problem on a high-dimensional system with strongly constrained and non-linear dynamics. We present a self-contained procedure which, through appropriate spatio-temporal stimulations of the neurons, is able to drive rate-based neural networks with arbitrary initial connectivity towards a desired functional state. We illustrate our approach on two different computational tasks: a non-linear association between multiple input stimulations and activity patterns (representing digit images), and the construction of a continuous attractor encoding a collective variable in a neural population. Our work thus provides a proof of principle for emerging paradigms of in vitro computation based on real neurons.
The rise of complex multicellular ecosystems Neoproterozoic time was preceded by a microbial Proterozoic biosphere, where productivity may have been largely restricted to microbial mats made up of bacteria including oxygenic photosynthetic Cyanobacteria, anoxygenic phototrophs, and heterotrophs. In modern environments, analogous microbial mats can be found in restricted environments such as carbonate tidal flats and terrestrial hot springs. Here, we report metagenomic sequence data from an analog in the hot springs of Waikite Valley, Aotearoa New Zealand, where carbon-rich, slightly-alkaline geothermal waters support diverse phototrophic microbial mats. The Waikite Valley hot spring in the Taupo Volcanic Zone of Aotearoa New Zealand was sampled in duplicate at 8 points along a temperature gradient transect of the outflow, from ~62 C (near the source) to ~37 C (~100 meters downstream). ~686 Gb of shotgun metagenomic sequence data was generated by Illumina Novaseq. Each sample was assembled using SPAdes, followed by binning of metagenome-assembled genomes (MAGs) by MetaBAT. These data are useful for the genomic analysis of novel phototrophic bacteria, as well as for ecological comparisons between thermophilic communities with varying temperatures but otherwise similar conditions.
Spatial omics assays allow for the molecular characterisation of cells in their spatial context. Notably, the two main technological streams, imaging-based and high-throughput sequencing-based, can give rise to very different data modalities. The characteristics of the two data types are well known in adjacent fields such as spatial statistics as point patterns and lattice data, and there is a wide range of tools available. This paper discusses the application of spatial statistics to spatially-resolved omics data and in particular, discusses various advantages, challenges, and nuances. This work is accompanied by a vignette, pasta, that showcases the usefulness of spatial statistics in biology using several R packages.
Human braingraphs or connectomes are widely studied in the last decade to understand the structural and functional properties of our brain. In the last several years our research group has computed and deposited thousands of human braingraphs to the braingraph.org site, by applying public structural (diffusion) MRI data from young and healthy subjects. Here we describe a recent addition to the {\tt braingraph.org} site, which contains connectomes from healthy and demented subjects between 42 and 95 years of age, based on the public release of the OASIS-3 dataset. The diffusion MRI data was processed with the Connectome Mapper Toolkit v.3.1. We believe that the new addition to the braingraph.org site will become a useful resource for enlightening the aging circuitry of the human brain in healthy and diseased subjects, including those with Alzheimer's disease in several stages.
Chromosomal inversions are structural mutations resulting in the reversal of the gene order along the corresponding genomic region. Due to their influence on recombination patterns, they can have a major influence on genetic variation and the evolutionary process. Accordingly, inversions can act as supergenes that keep together co-adapted gene complexes that form the genetic basis of many complex phenotypes in diverse organisms. In this book chapter, I will present an analysis pipeline to investigate the influence of two common cosmopolitan inversion, In(2L)t and In(3R)Payne, on genome-wide genetic variation and differentiation in world-wide populations of the vinegar fly Drosophila melanogaster. We will use single-individual and pooled resequencing data in combination with population genomics analysis tools to explore the impact of these two inversions on genetic variation, population structure, and clinal variation in natural populations.
Background: Effective use of mobile health technologies requires high participant adherence and retention. However, remote digital health studies often face high attrition and low adherence, potentially introducing bias and limiting generalizability. Objective: This study aims to identify longitudinal indicators of participant retention and adherence to develop strategies for improving data collection in digital health studies and understanding how cohorts are shaped by participant withdrawal and non-adherence. Methods: We conducted analyses on the Brighten study, a smartphone-based randomized controlled trial evaluating apps for depression treatment. Participants were asked to complete seven digital questionnaires regularly. Outcomes included adherence (questionnaire completion), engagement (post-baseline participation), and retention (continued participation over time). We analyzed relationships between these outcomes, static factors (e.g., demographics, average questionnaire scores) and dynamic factors (e.g., questionnaire score changes over time). Results: Of 2,201 participants, 1,093 completed at least one non-baseline questionnaire (median completion rate: 37.6%). Adherence was higher among participants with lower average depression severity (P<.001) and those perceiving improvement (P=.001). Demographic factors significantly influenced adherence and engagement. Participants with greater baseline depressive symptoms were more likely to withdraw before completing non-baseline questionnaires (t=-2.53, P=.01). However, symptom improvement was linked to better adherence (U=127,084; P<.001) and retention (HR=0.78, P=.002). Conclusion: Clinical trajectories and perceived improvements in depressive symptoms are key indicators of engagement, adherence, and retention. These findings may enhance data interpretation and inform strategies to boost retention and adherence in future trials.
Many cellular processes involve information processing and decision making. We can probe these processes at increasing molecular detail. The analysis of heterogeneous data remains a challenge that requires new ways of thinking about cells in quantitative, predictive, and mechanistic ways. We discuss the role of mathematical models in the context of cell-fate decision making systems across the tree of life. Complex multi-cellular organisms have been a particular focus, but single celled organisms also have to sense and respond to their environment. We center our discussion around the idea of design principles which we can learn from observations and modeling, and exploit in order to (re)-design or guide cellular behavior.
Building a general-purpose task model similar to ChatGPT has been an important research direction for gene large language models. Instruction fine-tuning is a key component in building ChatGPT, but existing instructions are primarily based on natural language. Natural language and gene sequences have significant differences in tokenization and encoding. Therefore, constructing a multilingual model that can handle both natural language and gene sequences is crucial for solving this problem.In this paper, we expand the capabilities of the LLaMA large language model to include gene language. This involves expanding the vocabulary using the Byte Pair Encoding (BPE) method, specifically tailored for DNA and protein sequences, and conducting further pre-training on these sequences. We then convert various downstream gene task data into a unified format for instruction fine-tuning and further fine-tune the model on this data.Our study demonstrates that a mixed model of gene and natural language, fine-tuned with instructions, achieves results comparable to the current state-of-the-art (SOTA) in tasks such as gene classification and gene sequence interaction. This provides a promising direction for building a unified large language model for gene tasks.
In this paper we consider a stochastic SEIQR (susceptible-exposed-infected-quarantined-recovered) epidemic model with a generalized incidence function. Using the Lyapunov method, we establish the existence and uniqueness of a global positive solution to the model, ensuring that it remains well-defined over time. Through the application of Young's inequality and Chebyshev's inequality, we demonstrate the concepts of stochastic ultimate boundedness and stochastic permanence, providing insights into the long-term behavior of the epidemic dynamics under random perturbations. Furthermore, we derive conditions for stochastic extinction, which describe scenarios where the epidemic may eventually die out, and V-geometric ergodicity, which indicates the rate at which the system's state converges to its equilibrium. Finally, we perform numerical simulations to verify our theoretical results and assess the model's behavior under different parameters.
Alzheimer's disease (AD) exhibits substantial clinical and biological heterogeneity, complicating efforts in treatment and intervention development. While new computational methods offer insights into AD progression, the reproducibility of these subtypes across datasets remains understudied, particularly concerning the robustness of subtype definitions when validated on diverse databases. This study evaluates the consistency of AD progression subtypes identified by the Subtype and Stage Inference (SuStaIn) algorithm using T1-weighted MRI data across 5,444 subjects from ANMerge, OASIS, and ADNI datasets, forming four independent cohorts. Each cohort was analyzed under two conditions: one using the full cohort, including cognitively normal controls, and another excluding controls to test subtype robustness. Results confirm the three primary atrophy subtypes identified in earlier studies: Typical, Cortical, and Subcortical, as well as the emergence of rare and atypical AD variants such as posterior cortical atrophy (PCA). Notably, each subtype displayed varying robustness to the inclusion of controls, with certain subtypes, like Subcortical, more influenced by cohort composition. This investigation underscores SuStaIn's reliability for defining stable AD subtypes and suggests its utility in clinical stratification for trials and diagnosis. However, our findings also highlight the need for improved dataset diversity, particularly in terms of ethnic representation, to enhance generalizability and support broader clinical application.
On broadly Copernican grounds, we are entitled to default assume that apparently behaviorally sophisticated extraterrestrial entities ("aliens") would be conscious. Otherwise, we humans would be inexplicably, implausibly lucky to have consciousness, while similarly behaviorally sophisticated entities elsewhere would be mere shells, devoid of consciousness. However, this Copernican default assumption is canceled in the case of behaviorally sophisticated entities designed to mimic superficial features associated with consciousness in humans ("consciousness mimics"), and in particular a broad class of current, near-future, and hypothetical robots. These considerations, which we formulate, respectively, as the Copernican and Mimicry Arguments, jointly defeat an otherwise potentially attractive parity principle, according to which we should apply the same types of behavioral or cognitive tests to aliens and robots, attributing or denying consciousness similarly to the extent they perform similarly. Instead of grounding speculations about alien and robot consciousness in metaphysical or scientific theories about the physical or functional bases of consciousness, our approach appeals directly to the epistemic principles of Copernican mediocrity and inference to the best explanation. This permits us to justify certain default assumptions about consciousness while remaining to a substantial extent neutral about specific metaphysical and scientific theories.
The application of language models (LMs) to molecular structure generation using line notations such as SMILES and SELFIES has been well-established in the field of cheminformatics. However, extending these models to generate 3D molecular structures presents significant challenges. Two primary obstacles emerge: (1) the difficulty in designing a 3D line notation that ensures SE(3)-invariant atomic coordinates, and (2) the non-trivial task of tokenizing continuous coordinates for use in LMs, which inherently require discrete inputs. To address these challenges, we propose Mol-StrucTok, a novel method for tokenizing 3D molecular structures. Our approach comprises two key innovations: (1) We design a line notation for 3D molecules by extracting local atomic coordinates in a spherical coordinate system. This notation builds upon existing 2D line notations and remains agnostic to their specific forms, ensuring compatibility with various molecular representation schemes. (2) We employ a Vector Quantized Variational Autoencoder (VQ-VAE) to tokenize these coordinates, treating them as generation descriptors. To further enhance the representation, we incorporate neighborhood bond lengths and bond angles as understanding descriptors. Leveraging this tokenization framework, we train a GPT-2 style model for 3D molecular generation tasks. Results demonstrate strong performance with significantly faster generation speeds and competitive chemical stability compared to previous methods. Further, by integrating our learned discrete representations into Graphormer model for property prediction on QM9 dataset, Mol-StrucTok reveals consistent improvements across various molecular properties, underscoring the versatility and robustness of our approach.
Recent trends in cell segmentation have shifted towards universal models to handle diverse cell morphologies and imaging modalities. However, for continuously emerging cell types and imaging techniques, these models still require hundreds or thousands of annotated cells for fine-tuning. We introduce CellSeg1, a practical solution for segmenting cells of arbitrary morphology and modality with a few dozen cell annotations in 1 image. By adopting Low-Rank Adaptation of the Segment Anything Model (SAM), we achieve robust cell segmentation. Tested on 19 diverse cell datasets, CellSeg1 trained on 1 image achieved 0.81 average mAP at 0.5 IoU, performing comparably to existing models trained on over 500 images. It also demonstrated superior generalization in cross-dataset tests on TissueNet. We found that high-quality annotation of a few dozen densely packed cells of varied sizes is key to effective segmentation. CellSeg1 provides an efficient solution for cell segmentation with minimal annotation effort.
Spatial Transcriptomics (ST) is a method that captures spatial gene expression profiles within histological sections. The discrete spatial distribution and the super-high dimensional sequencing results make ST data challenging to be modeled effectively. In this paper, we manage to model ST in a continuous and compact manner by the proposed tool, SUICA, empowered by the great approximation capability of Implicit Neural Representations (INRs) that can improve both the spatial resolution and the gene expression. Concretely within the proposed SUICA, we incorporate a graph-augmented Autoencoder to effectively model the context information of the unstructured spots and provide informative embeddings that are structure-aware for spatial mapping. We also tackle the extremely skewed distribution in a regression-by-classification fashion and enforce classification-based loss functions for the optimization of SUICA. By extensive experiments of a wide range of common ST platforms, SUICA outperforms both conventional INR variants and SOTA methods for ST super-resolution regarding numerical fidelity, statistical correlation, and bio-conservation. The prediction by SUICA also showcases amplified gene signatures that enriches the bio-conservation of the raw data and benefits subsequent analysis. The code is available at https://github.com/Szym29/SUICA.
Statistical physics provides tools for analyzing high-dimensional problems in machine learning and theoretical neuroscience. These calculations, particularly those using the replica method, often involve lengthy derivations that can obscure physical interpretation. We give concise, non-replica derivations of several key results and highlight their underlying similarities. Specifically, we introduce a cavity approach to analyzing high-dimensional learning problems and apply it to three cases: perceptron classification of points, perceptron classification of manifolds, and kernel ridge regression. These problems share a common structure -- a bipartite system of interacting feature and datum variables -- enabling a unified analysis. For perceptron-capacity problems, we identify a symmetry that allows derivation of correct capacities through a naïve method. These results match those obtained through the replica method.
Designing novel functional proteins crucially depends on accurately modeling their fitness landscape. Given the limited availability of functional annotations from wet-lab experiments, previous methods have primarily relied on self-supervised models trained on vast, unlabeled protein sequence or structure datasets. While initial protein representation learning studies solely focused on either sequence or structural features, recent hybrid architectures have sought to merge these modalities to harness their respective strengths. However, these sequence-structure models have so far achieved only incremental improvements when compared to the leading sequence-only approaches, highlighting unresolved challenges effectively leveraging these modalities together. Moreover, the function of certain proteins is highly dependent on the granular aspects of their surface topology, which have been overlooked by prior models. To address these limitations, we introduce the Sequence-Structure-Surface Fitness (S3F) model - a novel multimodal representation learning framework that integrates protein features across several scales. Our approach combines sequence representations from a protein language model with Geometric Vector Perceptron networks encoding protein backbone and detailed surface topology. The proposed method achieves state-of-the-art fitness prediction on the ProteinGym benchmark encompassing 217 substitution deep mutational scanning assays, and provides insights into the determinants of protein function. Our code is at https://github.com/DeepGraphLearning/S3F.
Given a rooted tree $T$ on $n$ non-root leaves with colored and zeroed nodes, we construct a linear space $L_T$ of $n\times n$ symmetric matrices with constraints determined by the combinatorics of the tree. When $L_T$ represents the covariance matrices of a Gaussian model, it provides natural generalizations of Brownian motion tree (BMT) models in phylogenetics. When $L_T$ represents a space of concentration matrices of a Gaussian model, it gives certain colored Gaussian graphical models, which we refer to as BMT derived models. We investigate conditions under which the reciprocal variety $L_T^{-1}$ is toric. Relying on the birational isomorphism of the inverse matrix map, we show that if the BMT derived graph of $T$ is vertex-regular and a block graph, under the derived Laplacian transformation, $L_T^{-1}$ is the vanishing locus of a toric ideal. This ideal is given by the sum of the toric ideal of the Gaussian graphical model on the block graph, the toric ideal of the original BMT model, and binomial linear conditions coming from vertex-regularity. To this end, we provide monomial parametrizations for these toric models realized through paths among leaves in $T$.
Ionizable lipids are essential in developing lipid nanoparticles (LNPs) for effective messenger RNA (mRNA) delivery. While traditional methods for designing new ionizable lipids are typically time-consuming, deep generative models have emerged as a powerful solution, significantly accelerating the molecular discovery process. However, a practical challenge arises as the molecular structures generated can often be difficult or infeasible to synthesize. This project explores Monte Carlo tree search (MCTS)-based generative models for synthesizable ionizable lipids. Leveraging a synthetically accessible lipid building block dataset and two specialized predictors to guide the search through chemical space, we introduce a policy network guided MCTS generative model capable of producing new ionizable lipids with available synthesis pathways.
Ecological forecasts are model-based statements about currently unknown ecosystem states in time or space. For a model forecast to be useful to inform decision-makers, model validation and verification determine adequateness. The measure of forecast goodness that can be translated into a limit up to which a forecast is acceptable is known as the `forecast horizon'. While verification of meteorological models follows strict criteria with established metrics and forecast horizons, assessments of ecological forecasting models still remain experiment-specific and forecast horizons are rarely reported. As such, users of ecological forecasts remain uninformed of how far into the future statements can be trusted. In this work, we synthesise existing approaches, define empirical forecast horizons in a unified framework for assessing ecological predictability and offer recipes on their computation. We distinguish upper and lower boundary estimates of predictability limits, reflecting the model's potential and actual forecast horizon, and show how a benchmark model can help determine its relative forecast horizon. The approaches are demonstrated with four case studies from population, ecosystem, and earth system research.
Recent advancements in multimodal pre-training models have significantly advanced computational pathology. However, current approaches predominantly rely on visual-language models, which may impose limitations from a molecular perspective and lead to performance bottlenecks. Here, we introduce a Unified Molecule-enhanced Pathology Image REpresentationn Learning framework (UMPIRE). UMPIRE aims to leverage complementary information from gene expression profiles to guide the multimodal pre-training, enhancing the molecular awareness of pathology image representation learning. We demonstrate that this molecular perspective provides a robust, task-agnostic training signal for learning pathology image embeddings. Due to the scarcity of paired data, approximately 4 million entries of spatial transcriptomics gene expression were collected to train the gene encoder. By leveraging powerful pre-trained encoders, UMPIRE aligns the encoders across over 697K pathology image-gene expression pairs. The performance of UMPIRE is demonstrated across various molecular-related downstream tasks, including gene expression prediction, spot classification, and mutation state prediction in whole slide images. Our findings highlight the effectiveness of multimodal data integration and open new avenues for exploring computational pathology enhanced by molecular perspectives. The code and pre-trained weights are available at https://github.com/Hanminghao/UMPIRE.
Single-molecule localization microscopy generates point clouds corresponding to fluorophore localizations. Spatial cluster identification and analysis of these point clouds are crucial for extracting insights about molecular organization. However, this task becomes challenging in the presence of localization noise, high point density, or complex biological structures. Here, we introduce MIRO (Multimodal Integration through Relational Optimization), an algorithm that uses recurrent graph neural networks to transform the point clouds in order to improve clustering efficiency when applying conventional clustering techniques. We show that MIRO supports simultaneous processing of clusters of different shapes and at multiple scales, demonstrating improved performance across varied datasets. Our comprehensive evaluation demonstrates MIRO's transformative potential for single-molecule localization applications, showcasing its capability to revolutionize cluster analysis and provide accurate, reliable details of molecular architecture. In addition, MIRO's robust clustering capabilities hold promise for applications in various fields such as neuroscience, for the analysis of neural connectivity patterns, and environmental science, for studying spatial distributions of ecological data.
The accurate prediction of B-cell epitopes is critical for guiding vaccine development against infectious diseases, including SARS and COVID-19. This study explores the use of a deep neural network (DNN) model to predict B-cell epitopes for SARS-CoVandSARS-CoV-2,leveraging a dataset that incorporates essential protein and peptide features. Traditional sequence-based methods often struggle with large, complex datasets, but deep learning offers promising improvements in predictive accuracy. Our model employs regularization techniques, such as dropout and early stopping, to enhance generalization, while also analyzing key features, including isoelectric point and aromaticity, that influence epitope recognition. Results indicate an overall accuracy of 82% in predicting COVID-19 negative and positive cases, with room for improvement in detecting positive samples. This research demonstrates the applicability of deep learning in epitope mapping, suggesting that such approaches can enhance the speed and precision of vaccine design for emerging pathogens. Future work could incorporate structural data and diverse viral strains to further refine prediction capabilities.
Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates. Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks and the transition from memorization to generalization is sharp with increasing task diversity. One interpretation is that a network's limited capacity to memorize favors generalization. Here, we examine the mechanistic underpinnings of this transition using a small transformer applied to a synthetic ICL task. Using theory and experiment, we show that the sub-circuits that memorize and generalize can be viewed as largely independent. The relative rates at which these sub-circuits learn explains the transition from memorization to generalization, rather than capacity constraints. We uncover a memorization scaling law, which determines the task diversity threshold at which the network generalizes. The theory quantitatively explains a variety of other ICL-related phenomena, including the long-tailed distribution of when ICL is acquired, the bimodal behavior of solutions close to the task diversity threshold, the influence of contextual and data distributional statistics on ICL, and the transient nature of ICL.