2025-03-28 | | Total: 20
The RNA structure-function relationship has recently garnered significant attention within the deep learning community, promising to grow in importance as nucleic acid structure models advance. However, the absence of standardized and accessible benchmarks for deep learning on RNA 3D structures has impeded the development of models for RNA functional characteristics. In this work, we introduce a set of seven benchmarking datasets for RNA structure-function prediction, designed to address this gap. Our library builds on the established Python library rnaglib, and offers easy data distribution and encoding, splitters and evaluation methods, providing a convenient all-in-one framework for comparing models. Datasets are implemented in a fully modular and reproducible manner, facilitating for community contributions and customization. Finally, we provide initial baseline results for all tasks using a graph neural network. Source code: https://github.com/cgoliver/rnaglib Documentation: https://rnaglib.org
Bacteria frequently colonize natural microcavities such as gut crypts, plant apoplasts, and soil pores. Recent studies have shown that the physical structure of these spaces plays a crucial role in shaping the stability and resilience of microbial populations (Karita et al., PNAS 2022, Postek et al. PNAS 2024). Here, we demonstrate that protected microhabitats can emerge dynamically, even in the absence of physical barriers. Interactions with surface features -- such as roughness or friction -- lead microbial populations to self-organize into effectively segregated subpopulations. Our numerical and analytical models reveal that this self-organization persists even when strains have different growth rates, allowing slower-growing strains to avoid competitive exclusion. These findings suggest that emergent spatial structuring can serve as a fundamental mechanism for maintaining microbial diversity, despite selection pressures, competition, and genetic drift.
This paper presents an analysis of the orientation selectivity properties of idealized models of complex cells in terms of affine quasi quadrature measures, which combine the responses of idealized models of simple cells in terms of affine Gaussian derivatives by (i) pointwise squaring, (ii) summation of responses for different orders of spatial derivation and (iii) spatial integration. Specifically, this paper explores the consequences of assuming that the family of spatial receptive fields should be covariant under spatial affine transformations, thereby implying that the receptive fields ought to span a variability over the degree of elongation. We investigate the theoretical properties of three main ways of defining idealized models of complex cells and compare the predictions from these models to neurophysiologically obtained receptive field histograms over the resultant of biological orientation selectivity curves. It is shown that the extended modelling mechanism lead to more uniform behaviour and a wider span over the values of the resultat that are covered, compared to earlier presented idealized models of complex cells without spatial integration. More generally, we propose that the presented methodology could be used as a new tool to evaluate other computational models of complex cells in relation to biological measurements.
The present study investigates the compatibility of mycoinsecticides based on isolates IBCB66 and Simbi BB15 of Beauveria bassiana and Esalq-1296 of Cordyceps javanica, which are registered for the management of Dalbulus maidis in Brazil, with synthetic fungicides. Irrespective of the fungicide, a total inhibition in the number of colony-forming units (CFUs), vegetative growth, conidiogenesis, and conidial viability of the three tested isolates was observed, with their incompatibility being indicated in the in vitro bioassays. However, the use of formulated mycoinsecticides mitigated the impact of these xenobiotics on the number of CFUs, with the commercial mycoinsecticide FlyControl (B. bassiana isolate Simbi BB15) being the least sensitive to the fungicides propiconazole + difenoconazole, bixafem + prothioconazole + trifloxystrobin and trifloxystrobin + tebuconazole. Nevertheless, an increase in exposure time (from 1.5 to 3 hours) generally led to an increase in the toxicity of fungicides towards entomopathogens. Physical-chemical compatibility assessments indicated that physical incompatibilities were observed, depending on the mycoinsecticide formulation. In addition, in vivo bioassays employing D. maidis adults demonstrated that, despite a synergistic effect on mortality in certain binary mixtures, no cadavers exposed to such mixtures exhibited fungal extrusion. Furthermore, analyses using UHPLC/MS/MS revealed alterations in the degradation kinetics (k) of the active ingredient (a.i.) pyraclostrobin, with changes greater than tenfold being observed in the different formulations of the fungicides that were tested. Consequently, given the diminished degradation kinetics of the active ingredients in maize plants, the implementation of mycoinsecticides should precede, in isolation, the application of synthetic fungicides within the framework of phytosanitary management of maize crops.
Explaining the wide range of dynamics observed in ecological communities is challenging due to the large number of species involved, the complex network of interactions among them, and the influence of multiple environmental variables. Here, we consider a general framework to model the dynamics of species-rich communities under the effects of external environmental factors, showing that it naturally leads to delayed interactions between species, and analyze the impact of such memory effects on population dynamics. Employing the generalized Lotka-Volterra equations with time delays and random interactions, we characterize the resulting dynamical phases in terms of the statistical properties of community interactions. Our findings reveal that memory effects can generate persistent and synchronized oscillations in species abundances in sufficiently competitive communities. This provides an additional explanation for synchronization in large communities, complementing known mechanisms such as predator-prey cycles and environmental periodic variability. Furthermore, we show that when reciprocal interactions are negatively correlated, time delays alone can induce chaotic behavior. This suggests that ecological complexity is not a prerequisite for unpredictable population dynamics, as intrinsic memory effects are sufficient to generate long-term fluctuations in species abundances. The techniques developed in this work are applicable to any high-dimensional random dynamical system with time delays.
Motivation: Bulk RNA-Seq is a widely used method for studying gene expression across a variety of contexts. The significance of RNA-Seq studies has grown with the advent of high-throughput sequencing technologies. Computational methods have been developed for each stage of the identification of differentially expressed genes. Nevertheless, there are few studies exploring the association between different types of methods. In this study, we evaluated the impact of the association of methodologies in the results of differential expression analysis. By adopting two data sets with qPCR data (to gold-standard reference), seven methods were implemented and assessed in R packages (EBSeq, edgeR, DESeq2, limma, SAMseq, NOISeq, and Knowseq), which was performed and assessed separately and in association. The results were evaluated considering the adopted qPCR data. Results: Here, we introduce consexpressionR, an R package that automates differential expression analysis using consensus of at least seven methodologies, producing more assertive results with a significant reduction in false positives. Availability: consexpressionR is an R package available via source code and support are available at GitHub (https://github.com/costasilvati/consexpressionR).
This study investigated age-related changes in functional connectivity using resting-state fMRI and explored the efficacy of traditional deep learning for classifying brain developmental stages (BDS). Functional connectivity was assessed using Seed-Based Phase Synchronization (SBPS) and Pearson correlation across 160 ROIs. Clustering was performed using t-SNE, and network topology was analyzed through graph-theoretic metrics. Adaptive learning was implemented to classify the age group by extracting bottleneck features through mobileNetV2. These deep features were embedded and classified using Random Forest and PCA. Results showed a shift in phase synchronization patterns from sensory-driven networks in youth to more distributed networks with aging. t-SNE revealed that SBPS provided the most distinct clustering of BDS. Global efficiency and participation coefficient followed an inverted U-shaped trajectory, while clustering coefficient and modularity exhibited a U-shaped pattern. MobileNet outperformed other models, achieving the highest classification accuracy for BDS. Aging was associated with reduced global integration and increased local connectivity, indicating functional network reorganization. While this study focused solely on functional connectivity from resting-state fMRI and a limited set of connectivity features, deep learning demonstrated superior classification performance, highlighting its potential for characterizing age-related brain changes.
Ermiao San (EMS), a traditional Chinese medicine composed of Atractylodes macrocephala and Cortex Phellodendron, has demonstrated therapeutic efficacy in rheumatoid arthritis (RA). Studies suggest that EMS modulates dendritic cell (DC) maturation in adjuvant arthritis (AA) rats, though the precise mechanisms remain unclear. Prostaglandin receptor 4 (EP4) is critical in inflammation and DC function, while cyclic adenosine monophosphate (cAMP) regulates cellular signaling, potentially influencing RA pathogenesis via protein kinase A (PKA) and cAMP response element-binding protein (CREB) activation. EMS exerts protective effects in RA rats by suppressing DC functions, including reduced EP4 mRNA/protein expression, diminished cAMP levels, and impaired CREB phosphorylation. Additionally, serum from EMS-treated rats inhibited antigen uptake by bone marrow-derived DCs (BMDCs), downregulating CD40, CD80, and CD86 expression and altering pro-inflammatory cytokine secretion. Mechanistically, EMS-treated serum suppressed the EP4-cAMP pathway by decreasing EP4 protein expression and CREB activation, alongside reduced intracellular cAMP and PKA levels in BMDCs co-stimulated with PGE2 and TNF-a. These findings indicate that EMS alleviates RA by inhibiting the EP4-cAMP-CREB signaling axis in DCs, providing a scientific rationale for its clinical application in RA treatment.
In this paper, we present a simple method to integrate risk-contact data, obtained via digital contact monitoring (DCM) apps, in conventional compartmental transmission models. During the recent COVID-19 pandemic, many such data have been collected for the first time via newly developed DCM apps. However, it is unclear what the added value of these data is, unlike that of traditionally collected data via, e.g., surveys during non-epidemic times. The core idea behind our method is to express the number of infectious individuals as a function of the proportion of contacts that were with infected individuals and use this number as a starting point to initialize the remaining compartments of the model. As an important consequence, using our method, we can estimate key indicators such as the effective reproduction number using only two types of daily aggregated contact information, namely the average number of contacts and the average number of those contacts that were with an infected individual. We apply our method to the recent COVID-19 epidemic in the Netherlands, using self-reported data from the health surveillance app COVID RADAR and proximity-based data from the contact tracing app CoronaMelder. For both data sources, our corresponding estimates of the effective reproduction number agree both in time and magnitude with estimates based on other more detailed data sources such as daily numbers of cases and hospitalizations. This suggests that the use of DCM data in transmission models, regardless of the precise data type and for example via our method, offers a promising alternative for estimating the state of an epidemic, especially when more detailed data are not available.
Deep learning models have become fundamental tools in drug design. In particular, large language models trained on biochemical sequences learn feature vectors that guide drug discovery through virtual screening. However, such models do not capture the molecular interactions important for binding affinity and specificity. Therefore, there is a need to 'compose' representations from distinct biological modalities to effectively represent molecular complexes. We present an overview of the methods to combine molecular representations and propose that future work should balance computational efficiency and expressiveness. Specifically, we argue that improvements in both speed and accuracy are possible by learning to merge the representations from internal layers of domain specific biological language models. We demonstrate that 'composing' biochemical language models performs similar or better than standard methods representing molecular interactions despite having significantly fewer features. Finally, we discuss recent methods for interpreting and democratizing large language models that could aid the development of interaction aware foundation models for biology, as well as their shortcomings.
Prior results for tRNA and 5S rRNA demonstrated that secondary structure prediction accuracy can be significantly improved by modifying the parameters in the multibranch loop entropic penalty function. However, for reasons not well understood at the time, the scale of improvement possible across both families was well below the level for each family when considered separately. We resolve this dichotomy here by showing that each family has a characteristic target region geometry, which is distinct from the other and significantly different from their own dinucleotide shuffles. This required a much more efficient approach to computing the necessary information from the branching parameter space, and a new theoretical characterization of the region geometries. The insights gained point strongly to considering multiple possible secondary structures generated by varying the multiloop parameters. We provide proof-of-principle results that this significantly improves prediction accuracy across all 8 additional families in the Archive II benchmarking dataset.
Mathematical and computational modelling in oncology has played an increasingly important role in not only understanding the impact of various approaches to treatment on tumour growth, but in optimizing dosing regimens and aiding the development of treatment strategies. However, as with all modelling, only an approximation is made in the description of the biological and physical system. Here we show that tissue-scale spatial structure can have a profound impact on the resilience of tumours to immunotherapy using a classical model incorporating IL-2 compounds and effector cells as treatment parameters. Using linear stability analysis, numerical continuation, and direct simulations, we show that diffusing cancer cell populations can undergo pattern-forming (Turing) instabilities, leading to spatially-structured states that persist far into treatment regimes where the corresponding spatially homogeneous systems would uniformly predict a cancer-free state. These spatially-patterned states persist in a wide range of parameters, as well as under time-dependent treatment regimes. Incorporating treatment via domain boundaries can increase this resistance to treatment in the interior of the domain, further highlighting the importance of spatial modelling when designing treatment protocols informed by mathematical models. Counter-intuitively, this mechanism shows that increased effector cell mobility can increase the resilience of tumours to treatment. We conclude by discussing practical and theoretical considerations for understanding this kind of spatial resilience in other models of cancer treatment, in particular those incorporating more realistic spatial transport.
Neurons encode information in a binary manner and process complex signals. However, predicting or generating diverse neural activity patterns remains challenging. In vitro and in vivo studies provide distinct advantages, yet no robust computational framework seamlessly integrates both data types. We address this by applying the Transformer model, widely used in large-scale language models, to neural data. To handle binary data, we introduced Dice loss, enabling accurate cross-domain neural activity generation. Structural analysis revealed how Dice loss enhances learning and identified key brain regions facilitating high-precision data generation. Our findings support the 3Rs principle in animal research, particularly Replacement, and establish a mathematical framework bridging animal experiments and human clinical studies. This work advances data-driven neuroscience and neural activity modeling, paving the way for more ethical and effective experimental methodologies.
The differentiation between pathological subtypes of non-small cell lung cancer (NSCLC) is an essential step in guiding treatment options and prognosis. However, current clinical practice relies on multi-step staining and labelling processes that are time-intensive and costly, requiring highly specialised expertise. In this study, we propose a label-free methodology that facilitates autofluorescence imaging of unstained NSCLC samples and deep learning (DL) techniques to distinguish between non-cancerous tissue, adenocarcinoma (AC), squamous cell carcinoma (SqCC), and other subtypes (OS). We conducted DL-based classification and generated virtual immunohistochemical (IHC) stains, including thyroid transcription factor-1 (TTF-1) for AC and p40 for SqCC, and evaluated these methods using two types of autofluorescence imaging: intensity imaging and lifetime imaging. The results demonstrate the exceptional ability of this approach for NSCLC subtype differentiation, achieving an area under the curve above 0.981 and 0.996 for binary- and multi-class classification. Furthermore, this approach produces clinical-grade virtual IHC staining which was blind-evaluated by three experienced thoracic pathologists. Our label-free NSCLC subtyping approach enables rapid and accurate diagnosis without conventional tissue processing and staining. Both strategies can significantly accelerate diagnostic workflows and support efficient lung cancer diagnosis, without compromising clinical decision-making.
AI-assisted protein design has emerged as a critical tool for advancing biotechnology, as deep generative models have demonstrated their reliability in this domain. However, most existing models primarily utilize protein sequence or structural data for training, neglecting the physicochemical properties of proteins.Moreover, they are deficient to control the generation of proteins in intuitive conditions. To address these limitations,we propose CMADiff here, a novel framework that enables controllable protein generation by aligning the physicochemical properties of protein sequences with text-based descriptions through a latent diffusion process. Specifically, CMADiff employs a Conditional Variational Autoencoder (CVAE) to integrate physicochemical features as conditional input, forming a robust latent space that captures biological traits. In this latent space, we apply a conditional diffusion process, which is guided by BioAligner, a contrastive learning-based module that aligns text descriptions with protein features, enabling text-driven control over protein sequence generation. Validated by a series of evaluations including AlphaFold3, the experimental results indicate that CMADiff outperforms protein sequence generation benchmarks and holds strong potential for future applications. The implementation and code are available at https://github.com/HPC-NEAU/PhysChemDiff.
Population genetic processes, such as the adaptation of a quantitative trait to directional selection, may occur on longer time scales than the sweep of a single advantageous mutation. To study such processes in finite populations, approximations for the time course of the distribution of a beneficial mutation were derived previously by branching process methods. The application to the evolution of a quantitative trait requires bounds for the probability of survival \Sn up to generation n of a single beneficial mutation. Here, we present a method to obtain a simple, analytically explicit, either upper or lower, bound for \Sn in a supercritical Galton-Watson process. We prove the existence of an upper bound for offspring distributions including Poisson and binomial. They are constructed by bounding the given generating function, φ, by a fractional linear one that has the same survival probability \Sinf and yields the same rate of convergence of \Sn to \Sinf as φ. For distributions with at most three offspring, we characterize when this method yields an upper bound, a lower bound, or only an approximation. Because for many distributions it is difficult to get a handle on \Sinf, we derive an approximation by series expansion in s, where s is the selective advantage of the mutant. We briefly review well-known asymptotic results that generalize Haldane's approximation 2s for \Sinf, as well as less well-known results on sharp bounds for \Sinf. We apply them to explore when bounds for \Sn exist for a family of generalized Poisson distributions. Numerical results demonstrate the accuracy of our and of previously derived bounds for \Sinf and \Sn. Finally, we treat an application of these results to determine the response of a quantitative trait to prolonged directional selection.
In the past few decades, the life sciences have experienced an unprecedented accumulation of data, ranging from genomic sequences and proteomic profiles to heavy-content imaging, clinical assays, and commercial biological products for research. Traditional static databases have been invaluable in providing standardized and structured information. However, they fall short when it comes to facilitating exploratory data interrogation, real-time query, multidimensional comparison and dynamic visualization. Interactive databases aiming at supporting user-driven data queries and visualization offer promising new avenues for making the best use of the vast and heterogeneous data streams collected in biological research. This article discusses the potential of interactive databases, highlighting the importance of implementing this model in the life sciences, while going through the state-of-the-art in database design, technical choices behind modern data management systems, and emerging needs in multidisciplinary research. Special attention is given to data interrogation strategies, user interface design, and comparative analysis capabilities, along with challenges such as data standardization and scalability in data-heavy applications. Conceptual features for developing interactive databases along diverse life science domains are then presented in the user case of cell line selection for in vitro research to bridge the gap between research data generation, actionable biological insight, subsequent meaningful experimental design, and clinical relevance.
The genome contains genetic information essential for cell's life. The genome's spatial organization inside the cell nucleus is critical for its proper function including gene regulation. The two major genomic compartments -- euchromatin and heterochromatin -- contain largely transcriptionally active and silenced genes, respectively, and exhibit distinct dynamics. In this work, we present a hydrodynamic framework that describes the large-scale behavior of euchromatin and heterochromatin, and accounts for the interplay of mechanical forces, active processes, and nuclear confinement. Our model shows contractile stresses from cross-linking proteins lead to the formation of heterochromatin droplets via mechanically driven phase separation. These droplets grow, coalesce, and in nuclear confinement, wet the boundary. Active processes, such as gene transcription in euchromatin, introduce non-equilibrium fluctuations that drive long-range, coherent motions of chromatin as well as the nucleoplasm, and thus alter the genome's spatial organization. These fluctuations also indirectly deform heterochromatin droplets, by continuously changing their shape. Taken together, our findings reveal how active forces, mechanical stresses and hydrodynamic flows contribute to the genome's organization at large scales and provide a physical framework for understanding chromatin organization and dynamics in live cells.
We study the equilibrium phases of a generalized Lotka-Volterra model characterized by a species interaction matrix which is random, sparse and symmetric. Dynamical fluctuations are modeled by a demographic noise with amplitude proportional to the effective temperature T. The equilibrium distribution of species abundances is obtained by means of the cavity method and the Belief Propagation equations, which allow for an exact solution on sparse networks. Our results reveal a rich and non-trivial phenomenology that deviates significantly from the predictions of fully connected models. Consistently with data from real ecosystems, which are characterized by sparse rather than dense interaction networks, we find strong deviations from Gaussianity in the distribution of abundances. In addition to the study of these deviations from Gaussianity, which are not related to multiple-equilibria, we also identified a novel topological glass phase, present at both finite temperature, as shown here, and at T=0, as previously suggested in the literature. The peculiarity of this phase, which differs from the multiple-equilibria phase of fully-connected networks, is its strong dependence on the presence of extinctions. These findings provide new insights into how network topology and disorder influence ecological networks, particularly emphasizing that sparsity is a crucial feature for accurately modeling real-world ecological phenomena.
The development of biologically interpretable and explainable models remains a key challenge in computational pathology, particularly for multistain immunohistochemistry (IHC) analysis. We present BioX-CPath, an explainable graph neural network architecture for whole slide image (WSI) classification that leverages both spatial and semantic features across multiple stains. At its core, BioX-CPath introduces a novel Stain-Aware Attention Pooling (SAAP) module that generates biologically meaningful, stain-aware patient embeddings. Our approach achieves state-of-the-art performance on both Rheumatoid Arthritis and Sjogren's Disease multistain datasets. Beyond performance metrics, BioX-CPath provides interpretable insights through stain attention scores, entropy measures, and stain interaction scores, that permit measuring model alignment with known pathological mechanisms. This biological grounding, combined with strong classification performance, makes BioX-CPath particularly suitable for clinical applications where interpretability is key. Source code and documentation can be found at: https://github.com/AmayaGS/BioX-CPath.