Bioinformatics

2025-04-09 | | Total: 16

#1 OpDetect: A convolutional and recurrent neural network classifier for precise and sensitive operon detection from RNA-seq data [PDF] [Copy] [Kimi] [REL]

Authors: Rezvan Karaji, Lourdes Pena-Castillo

An operon refers to a group of neighbouring genes belonging to one or more overlapping transcription units that are transcribed in the same direction and have at least one gene in common. Operons are a characteristic of prokaryotic genomes. Identifying which genes belong to the same operon facilitates understanding of gene function and regulation. There are several computational approaches for operon detection; however, many of these computational approaches have been developed for a specific target bacterium or require information only available for a restricted number of bacterial species. Here, we introduce a general method, OpDetect, that directly utilizes RNA-sequencing (RNA-seq) reads as a signal over nucleotide bases in the genome. This representation enabled us to employ a convolutional and recurrent deep neural network architecture which demonstrated superior performance in terms of recall, f1-score and AUROC compared to previous approaches. Additionally, OpDetect showcases species-agnostic capabilities, successfully detecting operons in a wide range of bacterial species and even in Caenorhabditis elegans, one of few eukaryotic organisms known to have operons. OpDetect is available at https://github.com/BioinformaticsLabAtMUN/OpDetect.

Subject: Bioinformatics

Publish: 2025-03-28


#2 WIMOAD: Weighted Integration of Multi-Omics data for Alzheimer’s Disease (AD) Diagnosis [PDF] [Copy] [Kimi] [REL]

Authors: Hanyu Xiao, Jieqiong Wang, Shibiao Wan

As the most common subtype of dementia, Alzheimer’s disease (AD) is characterized by a progressive decline in cognitive functions, especially in memory, thinking, and reasoning ability. Early diagnosis and interventions enable the implementation of measures to reduce or slow further regression of the disease, preventing individuals from severe brain function decline. The current framework of AD diagnosis depends on A/T/(N) biomarkers detection from cerebrospinal fluid or brain imaging data, which is invasive and expensive during the data acquisition process. Moreover, the pathophysiological changes of AD accumulate in amino acids, metabolism, neuroinflammation, etc., resulting in heterogeneity in newly registered patients. Recently, next generation sequencing (NGS) technologies have found to be a non-invasive, efficient and less-costly alternative on AD screening. However, most of existing studies rely on single omics only. To address these concerns, we introduce WIMOAD, a weighted integration of multi-omics data for AD diagnosis. WIMOAD synergistically leverages specialized classifiers for patients’ paired gene expression and methylation data for multi-stage classification. The resulting scores were then stacked with MLP-based meta-models for performance improvement. The prediction results of two distinct meta-models were integrated with optimized weights for the final decision-making of the model, providing higher performance than using single omics only. Remarkably, WIMOAD achieves significantly higher performance than using single omics alone in the classification tasks. The model’s overall performance also outperformed most existing approaches, highlighting its ability to effectively discern intricate patterns in multi-omics data and their correlations with clinical diagnosis results. In addition, WIMOAD also stands out as a biologically interpretable model by leveraging the SHapley Additive exPlanations (SHAP) to elucidate the contributions of each gene from each omics to the model output. We believe WIMOAD is a very promising tool for accurate AD diagnosis and effective biomarker discovery across different progression stages, which eventually will have consequential impacts on early treatment intervention and personalized therapy design on AD.

Subject: Bioinformatics

Publish: 2024-09-27


#3 NEAR: Neural Embeddings for Amino acid Relationships [PDF] [Copy] [Kimi] [REL]

Authors: Daniel Olson, Thomas Colligan, Daphne Demekas, Jack W. Roddy, Ken Youens-Clark, Travis J. Wheeler

Protein language models (PLMs) have recently demonstrated potential to supplant classical protein database search methods based on sequence alignment, but are slower than common alignment-based tools and appear to be prone to a high rate of false labeling. Here, we present NEAR, a method based on neural representation learning that is designed to improve both speed and accuracy of search for likely homologs in a large protein sequence database. NEAR’s ResNet embedding model is trained using contrastive learning guided by trusted sequence alignments. It computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of residue-level k-NN search and a simple neighbor aggregation scheme. Tests on a benchmark consisting of trusted remote homologs and randomly shuffled decoy sequences reveal that NEAR substantially improves accuracy relative to state-of-the-art PLMs, with lower memory requirements and faster embedding and search speed. While these results suggest that the NEAR model may be useful for standalone homology detection with increased sensitivity over standard alignment-based methods, in this manuscript we focus on a more straightforward analysis of the model’s value as a high-speed pre-filter for sensitive annotation. In that context, NEAR is at least 5x faster than the pre-filter currently used in the widely-used profile hidden Markov model (pHMM) search tool HMMER3, and also outperforms the pre-filter used in our fast pHMM tool, nail.

Subject: Bioinformatics

Publish: 2025-01-24


#4 In Silico Evaluation and Therapeutic Targeting of LVDD9B Protein for WSSV Inhibition: Molecular and Ecological Insights for Aquaculture Solutions [PDF] [Copy] [Kimi] [REL]

Authors: Md. Iftehimul, Neaz A. Hasan, Mst. Farzana Akter, Md. Arju Hossain, Sajia Afrin Tima, Amirul Kabir, Prottay Choudhury, Apurbo Bhowmick, Sakib Anzum Pranto, Ali Mohamod Wasaf Hasan, Siddique Akber Ansari, Md Habibur Rahman

Background: This study aimed to investigate structural dynamics, binding interactions, stability, pharmacokinetics, ecological risks, and bioactivity of shrimp receptor protein LVDD9B to identify potential therapeutic candidates against White Spot Syndrome Virus (WSSV). Methods: LVDD9B protein 3D structure was predicted using SWISS-MODEL and validated with ProSA and Ramachandran plots. Protein-protein docking between LVDD9B and VP26 (WSSV protein) was performed using HADDOCK 2.4 server. Molecular docking, dynamics simulations, binding-free energy calculations, principal component analysis (PCA), electrostatic, and vibrational frequency analyses evaluated binding affinity, stability and polarity of complexes. Results: 128-amino-acids of LVDD9B protein was predicted as predominantly cytoplasmic with stable, and hydrophilic, with structural analysis identified key secondary structures and conserved chitin-binding site. Docking studies revealed strong interactions between LVDD9B and VP26, supported by hydrogen-bonds and salt bridges. Molecular dynamics simulations demonstrated stable complexes with fluctuating RMSD values, and MM/GBSA calculations indicated favorable binding free energies. Pharmacokinetic analysis highlighted promising bioavailability and drug-like properties for Luteolin and Quercetin from Cuscuta reflexa, while ecological assessment identified Cosmosiin as least hazardous, with Quercetin and Luteolin showing higher toxicity. PCA revealed stable protein-ligand complexes with flexibility in Apo form. Isorhoifolin exhibited the lowest internal energy (-2099.4722 Hartree) and highest dipole moment (8.1833 Debye). Frontier orbital analysis showed HOMO-LUMO gaps (4.05 to 4.34 eV) influencing reactivity, while MEP and vibrational frequency analyses supported compound stability and bioactivity. Conclusions: This study explores LVDD9B protein structural and interaction dynamics for developing antiviral therapy against WSSV, highlighting therapeutic potential of Cosmosiin, Isorhoifolin, Quercetin and Luteolin based on their pharmacokinetic and ecological profiles.

Subject: Bioinformatics

Publish: 2025-04-09


#5 Quantifying Uncertainty in Phasor-Based Time-Domain Fluorescence Lifetime Imaging Microscopy [PDF] [Copy] [Kimi] [REL]

Authors: Qinyi Chen, Jongchan Park, Shuqi Mu, Liang Gao

The phasor approach to time-domain fluorescence lifetime imaging microscopy (FLIM) offers a powerful, fit-free method for analyzing complex fluorescence decay signals. However, its quantitative accuracy is fundamentally limited by noise-particularly photon shot noise-which introduces variability and bias in lifetime estimation and fluorophore unmixing. In this study, we present a theoretical uncertainty model for phasor-based time-domain FLIM that analytically captures the propagation of shot noise and quantifies its impact on phasor coordinates and fluorophore weight estimation. We validate the model using Monte Carlo simulations and experimental data acquired from standard fluorescent dyes and biological tissue samples. Our model improves the overall reliability and efficiency of phasor-based time-domain FLIM, particularly in photon-limited imaging applications.

Subject: Bioinformatics

Publish: 2025-04-09


#6 SpaceBF: Spatial coexpression analysis using Bayesian Fused approaches in spatial omics datasets [PDF] [Copy] [Kimi] [REL]

Authors: Souvik Seal, Brian Neelon

Advancements in spatial omics technologies have enabled the measurement of expression profiles of different molecules, such as genes (using spatial transcriptomics), and peptides, lipids, or N-glycans (using mass spectrometry imaging), across thousands of spatial locations within a tissue. While identifying molecules with spatially variable expression is a well-studied statistical problem, robust methodologies for detecting spatially varying co-expression between molecule pairs remain limited. To address this gap, we introduce a Bayesian fused modeling framework for estimating molecular co-expression at both local (location-specific) and global (tissue-wide) levels, offering a refined understanding of cell-cell communication (CCC) mediated through ligand-receptor and other molecular interactions. Through extensive simulations, we demonstrate that our approach, termed SpaceBF, achieves superior specificity and precision compared to existing methods that predominantly rely on geospatial metrics such as bivariate Moran's I and Lee's L. Applying our framework to real spatial transcriptomics datasets, we uncover novel biological insights into CCC patterns across different cancer types.

Subject: Bioinformatics

Publish: 2025-04-03


#7 Robin: An Advanced Tool for Comparative Loop Caller Result Analysis Leveraging Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: H. M. A. Mohit Chowdhury, Mattie Fuller, Oluwatosin E Oluwadare

There has been significant interest in genomics research, leading to the development of numerous new methods. One notable area of progress is in chromosome looping detection algorithms (also known as loop callers). However, despite these advancements, there is no available platform to analyze, compare, or benchmark current tools' results on the go. Developing such a platform is crucial to accelerate research and ensure the reliability and effectiveness of new methods in the field. Hence, in this work, we propose Robin, an advanced ready-to-go platform for comparative loop caller result analysis leveraging Large Language Models (LLMs). Robin is a web server designed to analyze loop caller results, offering a comprehensive range of analysis metrics such as recovery and overlap. It is integrated with HiGlass and incorporates LLMs to enable users to generate plots simply by providing instructions. Overall, Robin is a robust and comprehensive loop caller result analysis and visualization tool. It is publicly accessible at http://hicrobin.online, with a comprehensive documentation available at http://documentation.hicrobin.online/

Subject: Bioinformatics

Publish: 2025-04-09


#8 Self-contrastive learning enables interference-resilient and generalizable fluorescence microscopy signal detection without interference modeling [PDF] [Copy] [Kimi] [REL]

Authors: Fengdi Zhang, Ruqi Huang, Meiqian Xin, Haoran Meng, Danheng Gao, Ying Fu, Juntao Gao, Xiangyang Ji

Every weak signal in fluorescence microscopy may contain critical biological information. However, the interference resilience required to detect such signals has traditionally relied on task-specific interference modeling, which limits generalizability. Here, we present a self-contrastive learning-based signal detection solution that achieves interference resilience without the need for interference modeling, thereby offering high generalizability. The method, DEPAF (deep pattern fitting), is a module that contrasts asynchronously generated data views from asymmetric model paths to extract signals from interference, while incorporating highly parallel signal recognition and localization in the process. In benchmark tests, we show that DEPAF improves the detection rate of ultra-high-density signals under low signal-to-noise ratio conditions by an order of magnitude. It is also compatible with, and substantially enhances the performance of various imaging techniques, such as super-resolution imaging, spatial transcriptomic imaging, and two-photon calcium imaging. DEPAF is expected to advance the signal-centric fluorescence microscopy techniques and inspire further advancements, especially in the era of image-based multi-omics.

Subject: Bioinformatics

Publish: 2025-04-09


#9 A global P-Process map to avoid P-value abuse [PDF] [Copy] [Kimi] [REL]

Authors: Jing Xu, Siyu Wei, Junxian Tao, Chen Sun, Haiyan Chen, Lian Duan, Zhenwei Shang, Wenhua Lyu, Hongchao Lyu, Mingming Zhang, Yongshuai Jiang

P-abuse is serious in difference identification analysis of data. How to avoid P-abuse is a huge challenge. Here, we evolve P-value (a single value) to P-Process (a global landscape of P-values under different sample size) to help researchers correctly recognize and use p-value. We observed -ln(P-Process) after rotation has very similar morphology with Wiener Process (or Brownian motion). Based on this property, for any sample size, we estimated the 95% fluctuation range of P-value and further estimated how many samples (N95(α)) could make sure 95% of P-values less than the given significant level . Tests proved that the estimation obtains a good performance with only a small number of samples in each group. For broader accessibility, a free web-service, P-Process Map, is available online to show the whole landscape of P-Process. At the end of the article, we explained 10 of the most typical P-abuse problems which can be easily voided by using P-Process. This "rethinking" of P-value from a higher position, which is profoundly different from the way we have seen this for the past century, would yield a new era - 2P era in which hypothesis testing is strongly required to evolve from P-value to P-Process.

Subject: Bioinformatics

Publish: 2025-04-09


#10 NetSyn: genomic context exploration of protein families [PDF] [Copy] [Kimi] [REL]

Authors: Mark Stam, Jordan Langlois, Céline Chevalier, Guillaume Reboul, Karine Bastard, Claudine Médigue, David Vallenet

Background The growing availability of large genomic datasets presents an opportunity to discover novel metabolic pathways and enzymatic reactions profitable for industrial or synthetic biological applications. Efforts to identify new enzyme functions in this substantial number of sequences cannot be achieved without the help of bioinformatics tools and the development of new strategies. The classical way to assign a function to a gene uses sequence similarity. However, another way is to mine databases to identify conserved gene clusters (i.e. syntenies) as, in prokaryotic genomes, genes involved in the same pathway are frequently encoded in a single locus with an operonic organisation. This Genomic Context (GC) conservation is considered as a reliable indicator of functional relationships, and thus is a promising approach to improve the gene function prediction. Methods Here we present NetSyn (Network Synteny), a tool, which aims to cluster protein sequences according to the similarity of their genomic context rather than their sequence similarity. Starting from a set of protein sequences of interest, NetSyn retrieves neighbouring genes from the corresponding genomes as well as their protein sequence. Homologous protein families are then computed to measure synteny conservation between each pair of input sequences using a GC score. A network is then created where nodes represent the input proteins and edges the fact that two proteins share a common GC. The weight of the edges corresponds to the synteny conservation score. The network is then partitioned into clusters of proteins sharing a high degree of synteny conservation. Results As a proof of concept, we used NetSyn on two different datasets. The first one is made of homologous sequences of an enzyme family (the BKACE family, previously named DUF849) to divide it into sub-families of specific activities. NetSyn was able to go further by providing additional subfamilies in addition to those previously published. The second dataset corresponds to a set of non-homologous proteins consisting of different Glycosyl Hydrolases (GH) with the aim of interconnecting them and finding conserved operon-like genomic structures. NetSyn was able to detect the locus of Cellvibrio japonicus for the degradation of xyloglucan. It contains three non-homologous GH and was found conserved in fourteen bacterial genomes. Discussion NetSyn is able to cluster proteins according to their genomic context which is a way to make functional links between proteins without taking into count their sequence similarity only. We showed that NetSyn is efficient in exploring large protein families to define iso-functional groups. It can also highlight functional interactions between proteins from different families and predicts new conserved genomic structures that have not yet been experimentally characterised. NetSyn can also be useful in pinpointing mis-annotations that have been propagated in databases and in suggesting annotations on proteins currently annotated as “unknown”. NetSyn is freely available at https://github.com/labgem/netsyn.

Subject: Bioinformatics

Publish: 2023-02-15


#11 LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language [PDF] [Copy] [Kimi] [REL]

Authors: Yong He, Pan Fang, Yongtao Shan, Yuanfei Pan, Yanhong Wei, Yichang Chen, Yihao Chen, Yi Liu, Zhenyu Zeng, Zhan Zhou, Feng Zhu, Edward C. Holmes, Jieping Ye, Jun Li, Yuelong Shu, Mang Shi, Zhaorong Li

In recent years, significant advancements have been observed in the domain of Natural Language Processing(NLP) with the introduction of pre-trained foundational models, paving the way for utilizing similar AI technologies to interpret the language of biology. In this research, we introduce “LucaOne”, a novel pre-trained foundational model designed to integratively learn from the genetic and proteomic languages, encapsulating data from 169,861 species en-compassing DNA, RNA, and proteins. This work illuminates the potential for creating a biological language model aimed at universal bioinformatics appli-cation. Remarkably, through few-shot learning, this model efficiently learns the central dogma of molecular biology and demonstrably outperforms com-peting models. Furthermore, in tasks requiring inputs of DNA, RNA, proteins, or a combination thereof, LucaOne exceeds the state-of-the-art performance using a streamlined downstream architecture, thereby providing empirical ev-idence and innovative perspectives on the potential of foundational models to comprehend complex biological systems.

Subject: Bioinformatics

Publish: 2024-05-14


#12 Accurate Somatic SV detection via sequence graph model-based local pan-genome optimization [PDF] [Copy] [Kimi] [REL]

Authors: Kailing Tu, Qilin Zhang, Yang Li, Yucong Li, Lanfang Yuan, Jie Tang, Lin Xia, Jing Wang, Wei Huang, Dan Xie

Somatic structural variations (Somatic SVs) are critical genomic alterations with significant implications in cancer genomics. Although long-read sequencing (LRS) theoretically provides optimal resolution for detecting these variants due to its ability to span large genomic segments, current LRS - based methods, which are derived from short - read - based somatic SV detection algorithms, mainly rely on split - read information. The high error rate of long - read sequencing and the errors introduced by the seed-and-chaining strategy of mainstream alignment algorithms affect the accuracy of these split-reads, making precise detection of somatic SVs still a challenge. To address this issue, we propose the TDScope algorithm, which uses the complete sequence information of local genomic regions provided by long-read sequencing to construct a local graph genome and combines random forest technology to achieve precise detection of somatic structural variations. TDScope outperforms state-of-the-art somatic SV detection methods on paired long-read whole-genome sequencing (WGS) benchmark cell lines, with an average F1-score improvement of 20%. It also demonstrates superior performance in detecting somatic SVs and resolving heterogeneous genomes in tandem repeat-like simulated somatic SV datasets. We also provide the ScopeVIZ tool to offer users visualization evidence of local graph genomes and somatic SV sequences. All code implementations are publicly available on GitHub (https://github.com/Goatofmountain/TDScope).

Subject: Bioinformatics

Publish: 2025-02-14


#13 The South American MicroBiome Archive (saMBA): Enriching the healthy microbiome concept by evaluating uniqueness and biodiversity of neglected populations [PDF] [Copy] [Kimi] [REL]

Authors: Benjamin Valderrama, Paulina Calderon-Romero, Thomaz F.S. Bastiaanssen, Aonghus Lavelle, Gerard Clarke, John F Cryan

The composition and function of the human gut microbiome has been linked to multiple health outcomes across all world regions, often with region-specific associations. Unfortunately, the extent to which microbiomes from different populations are characterised is limited by their economic resources. Over 70% of the sequenced human microbiomes come from analyses of European and North American populations, skewing our understanding by focusing excessively on just 15% of the global population. Thus, entire continents rely on results from research conducted in wealthier countries whose main findings are unlikely to generalize across other world regions. Moreover, statistical models perform poorly when applied to minorities, a blind spot with serious consequences in biomedicine, and which can only be addressed by analysing microbiome data from currently neglected areas. To address this problem, we created saMBA, the largest archive of gut microbiomes from South America, one of the worlds most biodiverse regions in terms of the gut microbiome of its inhabitants, yet the one with the fewest samples. "saMBA" includes 33 gut microbiome studies, ~73% of which were incorporated in a microbiome archive for the first time. By leveraging this resource, we uncovered a high biodiversity within, and uniqueness between, gut microbiomes across the continent, expanding the concept of the healthy microbiome to be more globally representative. Additionally, our results highlight that the gut microbiome biodiversity of this region remains far from fully characterized. We demonstrate how saMBA can guide new sampling efforts to better capture this diversity. Finally, the code deployed to build saMBA is compatible with that of a previous global compendium and is openly available to researchers from other underrepresented regions, fostering the inclusion of other neglected populations to accelerate microbiome research globally.

Subject: Bioinformatics

Publish: 2025-04-09


#14 CoExpPhylo - A Novel Pipeline for Biosynthesis Gene Discovery [PDF] [Copy] [Kimi] [REL]

Authors: Nele Gruenig, Boas Pucker

Background The rapid advancement of sequencing technologies has drastically increased the availability of plant genomic and transcriptomic data, shifting the challenge from data generation to functional interpretation. Identifying genes involved in specialized metabolism remains difficult. While coexpression analysis is a widely used approach to identify genes acting in the same pathway or process, it has limitations, particularly in distinguishing genes coexpressed due to shared regulatory triggers from those directly involved in the same pathway. To enhance functional predictions, integrating phylogenetic analysis provides an additional layer of confidence by considering evolutionary conservation. Here, we introduce CoExpPhylo, a computational pipeline that systematically combines coexpression analysis and phylogenetics to identify candidate genes involved in specialized biosynthetic pathways across multiple species based on one to multiple bait gene candidates. Results CoExpPhylo systematically integrates coexpression information and phylogenetic signals to identify candidate genes involved in specialized biosynthetic pathways. The pipeline consists of multiple computational steps: (1) species-specific coexpression analysis, (2) local sequence alignment to identify orthologs, (3) clustering of candidate genes into Orthologous Coexpressed Groups (OCGs), (4) functional annotation, (5) global sequence alignment, (6) phylogenetic tree generation, and optionally (7) visualization. The workflow is highly customizable, allowing users to adjust correlation thresholds, filtering parameters, and annotation sources. Benchmarking CoExpPhylo on multiple pathways, including anthocyanin, proanthocyanidin, and flavonol biosynthesis, confirmed its ability to recover known genes while also suggesting novel candidates. Conclusion CoExpPhylo provides a systematic framework for identifying candidate genes involved in the specialized metabolism. By integrating coexpression data with phylogenetic clustering, it facilitates the discovery of both conserved and lineage-specific genes. The resulting OCGs offer a strong foundation for further experimental validation, bridging the gap between computational predictions and functional characterization. Future improvements, such as incorporating multi-species reference databases and refining clustering for large gene families, could further enhance its resolution. Overall, CoExpPhylo represents a valuable tool for accelerating pathway elucidation and advancing our understanding of specialized metabolism in plants.

Subject: Bioinformatics

Publish: 2025-04-09


#15 CDState: an unsupervised approach to predict malignant cell heterogeneity in tumor bulk RNA-sequencing data [PDF] [Copy] [Kimi] [REL]

Authors: Agnieszka Kraft, Josephine Yates, Valentina Boeva

Intratumor transcriptional heterogeneity (ITTH) presents a major challenge in cancer treatment, particularly due to the limited understanding of the diverse malignant cell populations and their relationship to therapy resistance. While single-cell sequencing has provided valuable insights into tumor composition, its high cost and technical complexity limits its use for large-scale tumor screening. In contrast, several databases collecting bulk RNA sequencing data from multiple samples across various cancer types are available and could be used to profile ITTH. Several deconvolution approaches have been developed to infer cellular composition from such data. However, most of these methods rely on predefined markers or reference datasets, limiting the performance of such methods by the quality of used reference data. On the other hand, unsupervised approaches do not face such limitations, but existing methods have not been specifically adapted to characterize malignant cell states, focusing instead on general cell types. To address these gaps, we introduce CDState, an unsupervised method for inferring malignant cell subpopulations from bulk RNA-seq data. CDState utilizes a Nonnegative Matrix Factorization (NMF) model improved with sum-to-one constraints and a cosine similarity-based optimization to deconvolve bulk gene expression into distinct cell state-specific profiles, and estimate the abundance of each state across tumor samples. We validate CDState using bulkified single-cell RNA-seq data from five cancer types, showing that it outperforms existing unsupervised deconvolution methods in both cell state proportions and gene expression estimation. Applying CDState to 33 cancer types from the TCGA, we identified recurrent malignant cell programs, including epithelial-mesenchymal transition (EMT) and hypoxia as main drivers of tumor transcriptional heterogeneity. We further link the identified malignant states to patient clinical features, revealing states associated with worse patient prognosis. Finally, we find that alterations in genes such as KRAS or TP53, whose copy number or mutation status strongly correlate with specific malignant states, may play a key role in driving these cellular states. Overall, CDState provides a powerful and accessible approach to resolving intratumor heterogeneity using bulk RNA-seq data and emerges as a promising tool for advancing our understanding of malignant cell heterogeneity.

Subject: Bioinformatics

Publish: 2025-03-07


#16 DeorphaNN: Virtual screening of GPCR peptide agonists using AlphaFold-predicted active state complexes and deep learning embeddings [PDF] [Copy] [Kimi] [REL]

Authors: Larissa Ferguson, Sébastien Ouellet, Elke Vandewyer, Christopher Wang, Zaw Wunna, Tony K.Y. Lim, William R. Schafer, Isabel Beets

G protein-coupled receptors (GPCRs) are important cell surface receptors involved in numerous physiological processes. Although peptides are the cognate ligands for many of these receptors, identifying endogenous peptide agonists for GPCRs remains a significant challenge. Deep learning-based protein structure prediction algorithms, such as AlphaFold (AF) have utility in non-structural tasks including protein-protein interaction prediction, suggesting they may be useful for predicting GPCR-peptide agonist interactions. Leveraging a dataset of experimentally validated agonist and non-agonist GPCR-peptide interactions from Caenorhabditis elegans, we show that AF-Multimer confidence metrics enable partial discrimination between GPCR-agonist and non-agonist complexes. To better reflect agonist-bound conformations, AF-Multistate templates are used to produce active-state GPCR-peptide complexes, improving discriminatory power. Embeddings from the final hidden layer of AF-Multimer's neural network, which capture structural and interaction patterns, were used to train random forest classifiers to assess whether AF-Multimer protein representations can distinguish agonist from non-agonist complexes. Feature performance analysis reveals that AF-Multimer's pair representations outperform single representations, with distinct subregions of the pair representation providing complementary predictive signals. Building on these findings, we developed DeorphaNN—a graph neural network that integrates active-state GPCR-peptide structural predictions, interatomic interactions, and pair representations to predict agonist identity. DeorphaNN's predictive utility generalizes to datasets outside of C. elegans, including annelids and humans, and experimental validation of predicted agonists for two orphan GPCRs uncovers their cognate agonists. Our approach offers a resource to accelerate GPCR deorphanization through the in silico identification of receptor-agonist candidates for AI-guided experimental validation.

Subject: Bioinformatics

Publish: 2025-03-20