2024-11-01 | | Total: 9
Proteins, essential to biological systems, perform functions intricately linked to their three-dimensional structures. Understanding the relationship between protein structures and their amino acid sequences remains a core challenge in protein modeling. While traditional protein foundation models benefit from pre-training on vast unlabeled datasets, they often struggle to capture critical co-evolutionary information, which evolutionary-based methods excel at. In this study, we introduce a novel pre-training strategy for protein foundation models that emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features from sequence data. Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability, outperforming established baselines of similar size, including the ESM model, across diverse downstream tasks. Experimental results confirm the model's effectiveness in integrating co-evolutionary information, marking a significant step forward in protein sequence-based modeling.
Visualization of multidimensional, categorical data is a common challenge across scientific areas and, in particular, the life sciences. The goal is to create a comprehensive overview of the underlying data which allows to assess multiple variables intuitively. One application where such visualizations are particularly useful is pathway analysis, where we check for dysregulation in known biological regulatory mechanisms and functions across multiple conditions. Here, we propose a new visualization approach that codes such data in a comprehensive and intuitive representation: Dice plots visualize up to four distinct categorical classes in a single view that consist of multiple elements resembling the faces of dice, whereas domino plots add an additional layer of information for binary comparison. The code is available as the diceplot R package, as pydiceplot on pip and at https://github.com/maflot.
Phenology, the timing of cyclical plant life events such as leaf emergence and coloration, is crucial in the bio-climatic system. Climate change drives shifts in these phenological events, impacting ecosystems and the climate itself. Accurate phenology models are essential to predict the occurrence of these phases under changing climatic conditions. Existing methods include hypothesis-driven process models and data-driven statistical approaches. Process models account for dormancy stages and various phenology drivers, while statistical models typically rely on linear or traditional machine learning techniques. Research shows that process models often outperform statistical methods when predicting under climate conditions outside historical ranges, especially with climate change scenarios. However, deep learning approaches remain underexplored in climate phenology modeling. We introduce PhenoFormer, a neural architecture better suited than traditional statistical methods at predicting phenology under shift in climate data distribution, while also bringing significant improvements or performing on par to the best performing process-based models. Our numerical experiments on a 70-year dataset of 70,000 phenological observations from 9 woody species in Switzerland show that PhenoFormer outperforms traditional machine learning methods by an average of 13% R2 and 1.1 days RMSE for spring phenology, and 11% R2 and 0.7 days RMSE for autumn phenology, while matching or exceeding the best process-based models. Our results demonstrate that deep learning has the potential to be a valuable methodological tool for accurate climate-phenology prediction, and our PhenoFormer is a first promising step in improving phenological predictions before a complete understanding of the underlying physiological mechanisms is available.
The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: \textit{de novo} molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at \url{https://github.com/pluskal-lab/MassSpecGym}.
Constructing atomic models from cryo-electron microscopy (cryo-EM) maps is a crucial yet intricate task in structural biology. While advancements in deep learning, such as convolutional neural networks (CNNs) and graph neural networks (GNNs), have spurred the development of sophisticated map-to-model tools like DeepTracer and ModelAngelo, their efficacy notably diminishes with low-resolution maps beyond 4 Å. To address this shortfall, our research introduces DeepTracer-LowResEnhance, an innovative framework that synergizes a deep learning-enhanced map refinement technique with the power of AlphaFold. This methodology is designed to markedly improve the construction of models from low-resolution cryo-EM maps. DeepTracer-LowResEnhance was rigorously tested on a set of 37 protein cryo-EM maps, with resolutions ranging between 2.5 to 8.4 Å, including 22 maps with resolutions lower than 4 Å. The outcomes were compelling, demonstrating that 95.5\% of the low-resolution maps exhibited a significant uptick in the count of total predicted residues. This denotes a pronounced improvement in atomic model building for low-resolution maps. Additionally, a comparative analysis alongside Phenix's auto-sharpening functionality delineates DeepTracer-LowResEnhance's superior capability in rendering more detailed and precise atomic models, thereby pushing the boundaries of current computational structural biology methodologies.
The accurate prediction of geometric state evolution in complex systems is critical for advancing scientific domains such as quantum chemistry and material modeling. Traditional experimental and computational methods face challenges in terms of environmental constraints and computational demands, while current deep learning approaches still fall short in terms of precision and generality. In this work, we introduce the Geometric Diffusion Bridge (GDB), a novel generative modeling framework that accurately bridges initial and target geometric states. GDB leverages a probabilistic approach to evolve geometric state distributions, employing an equivariant diffusion bridge derived by a modified version of Doob's $h$-transform for connecting geometric states. This tailored diffusion process is anchored by initial and target geometric states as fixed endpoints and governed by equivariant transition kernels. Moreover, trajectory data can be seamlessly leveraged in our GDB framework by using a chain of equivariant diffusion bridges, providing a more detailed and accurate characterization of evolution dynamics. Theoretically, we conduct a thorough examination to confirm our framework's ability to preserve joint distributions of geometric states and capability to completely model the underlying dynamics inducing trajectory distributions with negligible error. Experimental evaluations across various real-world scenarios show that GDB surpasses existing state-of-the-art approaches, opening up a new pathway for accurately bridging geometric states and tackling crucial scientific challenges with improved accuracy and applicability.
The vascular network of leaves, comprising xylem and phloem, is a highly optimized system for the delivery of water, nutrients, and sugars. The design rules for these naturally occurring networks have been studied since the time of Leonardo da Vinci, who constructed a local rule for comparing the widths of in- and outgoing veins at branch points. Recently, physical models have been developed that seek to explain the full morphogenesis of leaf venial networks in which veins grow in response to local hydrodynamic feedback. Although these models go beyond simple local rules, they are challenging to compare to experimental data. Here, we extend these hydrodynamic models to a state where the direct comparison with images of full leaves becomes possible on the level of individual veins. We present a dataset of the venial networks of leaves that maintain full network topology and use this to discuss the benefits and drawbacks of such a direct comparison. We apply our approach to the direct estimation of a sink fluctuation parameter, demonstrating consistency within distinct leaf species. Finally, we utilize the ability of the model to run on full leaves to define and calculate exponents for a Murray's law that applies to reticulate venation networks.
Enormous progress have been made in the last 20 years since the publication of our review \cite{csk05polrev} in this journal on transport and traffic phenomena in biology. In this brief article we present a glimpse of the major advances during this period. First, we present similarities and differences between collective intracellular transport of a single micron-size cargo by multiple molecular motors and that of a cargo particle by a team of ants on the basis of the common principle of load-sharing. Second, we sketch several models all of which are biologically motivated extensions of the Asymmetric Simple Exclusion Process (ASEP); some of these models represent the traffic of molecular machines, like RNA polymerase (RNAP) and ribosome, that catalyze template-directed polymerization of RNA and proteins, respectively, whereas few other models capture the key features of the traffic of ants on trails. More specifically, using the ASEP-based models we demonstrate the effects of traffic of RNAPs and ribosomes on random and `programmed' errors in gene expression as well as on some other subcellular processes. We recall a puzzling empirical result on the single-lane traffic of predatory ants {\it Leptogenys processionalis} as well as recent attempts to account for this puzzle. We also mention some surprising effects of lane-changing rules observed in a ASEP-based model for 3-lane traffic of army ants. Finally, we explain the conceptual similarities between the pheromone-mediated indirect communication, called stigmergy, between ants on a trail and the floor-field-mediated interaction between humans in a pedestrian traffic. For the floor-field model of human pedestrian traffic we present a major theoretical result that is relevant from the perspective of all types of traffic phenomena.
Data increasingly take the form of a multi-way array, or tensor, in several biomedical domains. Such tensors are often incompletely observed. For example, we are motivated by longitudinal microbiome studies in which several timepoints are missing for several subjects. There is a growing literature on missing data imputation for tensors. However, existing methods give a point estimate for missing values without capturing uncertainty. We propose a multiple imputation approach for tensors in a flexible Bayesian framework, that yields realistic simulated values for missing entries and can propagate uncertainty through subsequent analyses. Our model uses efficient and widely applicable conjugate priors for a CANDECOMP/PARAFAC (CP) factorization, with a separable residual covariance structure. This approach is shown to perform well with respect to both imputation accuracy and uncertainty calibration, for scenarios in which either single entries or entire fibers of the tensor are missing. For two microbiome applications, it is shown to accurately capture uncertainty in the full microbiome profile at missing timepoints and used to infer trends in species diversity for the population. Documented R code to perform our multiple imputation approach is available at https://github.com/lockEF/MultiwayImputation .