2025-05-17 | | Total: 10
Cell-free DNA (cfDNA) analysis offers a powerful, non-invasive approach to cancer diagnostics and monitoring by revealing tumor-specific genomic and epigenetic alterations. Here, we demonstrate the versatility of MeD-seq, a methylation-dependent sequencing assay, for comprehensive cfDNA analysis, including methylation profiling, chromosomal copy number (CN) alterations, and tumor fraction (TF) estimation. MeD-seq-derived CN profiles and TF estimates from 38 colorectal cancer with liver metastases (CRLM) and 5 ovarian cancer patients were highly comparable to shallow whole-genome sequencing (sWGS) validating our approach. These findings establish MeD-seq as a robust and cost-effective platform for detecting cancer-specific signals directly from plasma without prior tissue-based information. Future work should expand its application to other cancer types, solidifying MeD-seq as a versatile tool for minimally-invasive cfDNA profiling.
Cancer is a leading global cause of mortality, responsible for nearly 10 million deaths annually, with breast, lung, colorectal, and prostate cancers among the most prevalent. Despite extensive research on individual cancer types, identifying shared molecular signatures could unlock pan-cancer diagnostic tools and therapeutic targets. This study leveraged RNA-seq data from The Cancer Genome Atlas (TCGA) to analyze our selected cancer types (SCTs), breast, lung, colorectal, and prostate, employing a multi-step bioinformatics pipeline. Differentially expressed genes (DEGs) between tumors and normal tissues were identified and validated using an Elastic Net regression model. Weighted Gene Co-expression Network Analysis (WGCNA) revealed highly correlated gene modules linked to clinical traits, pinpointing 179 shared signature genes across the SCTs. Protein-protein interaction (PPI) network and clustering analyses further refined these to 26 hub genes, enriched in cancer hallmark pathways. Nine hub genes KIF18B, RRM2, MYBL2, IQGAP3, TPX2, SLC7A11, RHPN1, HJURP, and SKA3 are stood out due to consistent upregulation in metastatic tumors (breast, colorectal, prostate) and their high expression across more than 18 different cancer types, suggesting roles as oncogenes, prognostic markers, or therapeutic targets. The expression patterns of hub genes were further validated across larger cancer patient cohorts of SCTs, confirming relevance across multiple datasets, and their prognostic significance was assessed by their influence on overall survival (OS). Notably, these hub genes also correlated with immune-related functions, potentially influencing tumor microenvironment modulation. This integrative approach provides a strong framework for identifying cross-cancer potential biomarkers, advancing pan-cancer insights, and supporting improved diagnosis, prognosis, and therapy development.
This study presents a case analysis of using AI systems as co-pilots in biological research, focusing on functional protein networks in invasive colorectal cancer. We used public proteomic data alongside ChatGPT, GitHub Copilot, and PaperQA to automate parts of the workflow, including literature review, code generation, and network analysis. While AI tools improved efficiency, they required expert guidance for tasks involving complex metadata, domain-specific parsing, and reproducibility. Our analysis identified cytoskeleton- and signaling-related networks in invasive cancer, aligning with known biology, but attempts to distinguish invasive from non-invasive cases produced inconclusive results. An attempt to conduct fully automated research using Agent Laboratory failed due to hallucinated data, misinterpretation of research goals, and instability as the complexity of the underlying LLM increased. These findings show that current AI can assist but not replace human researchers in complex biotech studies.
The cell cycle consists of four phases and impacts most cellular processes. In imaging assays, the cycle phase can be identified using dedicated cell-cycle markers. However, such markers occupy fluorescent channels that may be needed for other reporters. Here, we propose to address this limitation by inferring the phase from a widely used fluorescent reporter: SiR-DNA. Our method is based on a variational auto-encoder, enhanced with two auxiliary tasks: predicting the intensity of phase-specific markers and enforcing the latent space temporal consistency. Our model is freely available, along with a new dataset comprising over 600,000 annotated HeLa Kyoto nuclear images.
Computational prediction of three-dimensional (3D) genome organization provides an alternative approach to overcome the cost and technical limitations of Hi-C experiments. However, current Hi-C prediction models are constrained by their narrow applicability to studying the impact of genetic variation on genome folding in specific cell lines, significantly restricting their biological utility. We present Hi-Compass, a generalizable deep learning model that accurately predicts chromatin organization across diverse biological contexts, from bulk to single-cell samples. Hi-Compass outperforms existing methods in prediction accuracy and is generalizable to unseen cell types through chromatin accessibility data, enabling broad applications in single cell omics. Hi-Compass successfully resolves cell-type-specific 3D genome architectures in complex biological scenarios, including immune cell states, organ heterogeneity, and tissue spatial organization. Furthermore, Hi-Compass enables integrative analysis of single-cell multiome data, linking chromatin interaction dynamics to gene expression changes across cell clusters, and mapping disease variants to pathogenic genes. Hi-Compass also extends to spatial multi-omics data, generating spatially resolved Hi-C maps that reveal domain-specific chromatin interactions linked to spatial gene expression patterns.
Accurate characterization of genetic variation is fundamental to genomics. While long-read sequencing technologies promise to resolve complex genomic regions and improve variant detection, their application in polyploid and complex genomes remains challenging. Here, we systematically investigate the factors influencing variant calling accuracy using long reads. Using human trio data with known variants to simulate variable ploidy levels (diploid, tetraploid, hexaploid), we demonstrate that while variant sites can often be identified accurately, genotyping accuracy significantly decreases with increasing ploidy due to allelic dosage uncertainty. This highlights a specific challenge in assigning correct allele counts in polyploids even with high depth, separate from the initial variant discovery. We then assessed variant detection performance in genomes with varying complexity: the relatively simple diploid Fragaria vesca, the tetraploid Solanum tuberosum, and the highly repetitive diploid Zea mays. Our results reveal that overall variant calling accuracy correlates more strongly with inherent genome complexity (e.g., repeat content) than with ploidy level alone. Furthermore, we identify a critical mechanism impacting variant discovery: structural variations between the reference and sample genomes, particularly those containing repetitive elements, induce spurious read mapping. This leads to false variant calls, constituting a distinct and more dominant source of error than allelic-dosage uncertainty. Our findings underscore the multifaceted challenges in long-read variant analysis and highlight the need for ploidy-aware genotypers, complexity-informed variant callers, and bias-aware mapping strategies to fully realize the potential of long reads in diverse organisms.
We introduce a visual analytics methodology for survival analysis, and propose a framework that defines a reusable set of visualization and modeling components to support exploratory and hypothesis-driven biomarker discovery. Survival analysis—essential in biomedicine—evaluates patients’ survival rates and the onset of medically relevant events, given their clinical and genetic profiles and genetic predispositions. Existing approaches often require programming expertise or rely on inflexible analysis pipelines, limiting their usability among biomedical researchers. The lack of advanced, user-friendly tools hinders problem solving, limits accessibility for biomedical researchers, and restricts interactive data exploration. Our methodology emphasizes functionality-driven design and modularity, akin to combining LEGO bricks to build tailored visual workflows. We (1) define a minimal set of reusable visualization and modeling components that support common survival analysis tasks, (2) implement interactive visualizations for discovering survival cohorts and their characteristic features, and (3) demonstrate integration within an existing visual analytics platform. We implemented the methodology as an open-source add-on to Orange Data Mining and validated it through use cases ranging from Kaplan–Meier estimation to biomarker discovery. The resulting framework illustrates how methodological design can drive intuitive, transparent, and effective survival analysis.
Understanding protein functions facilitates the identification of the underlying causes of many diseases and guides the research for discovering new therapeutic targets and medications. With the advancement of high throughput technologies, obtaining novel protein sequences has been a routine process. However, determining protein functions experimentally is cost- and labor-prohibitive. Therefore, it is crucial to develop computational methods for automatic protein function prediction. In this study, we propose a multi-modal deep learning architecture called ProtFun to predict protein functions. ProtFun integrates protein large language model (LLM) embeddings as node features in a protein family network. Employing graph attention networks (GAT) on this protein family network, ProtFun learns protein embeddings, which are integrated with protein signature representations from InterPro to train a protein function prediction model. We evaluated our architecture using \textcolor{red}{three} benchmark datasets. Our results showed that our proposed approach outperformed current state-of-the-art methods for most cases. An ablation study also highlighted the importance of different components of ProtFun. The data and source code of ProtFun is available at https://github.com/bozdaglab/ProtFun under Creative Commons Attribution Non Commercial 4.0 International Public License.
We developed a novel bioinformatics pipeline that reveals key transcription factors (TFs) regulating type 1 diabetes transcriptomes by combining automated Python workflows, DESeq2-based differential expression analysis, motif enrichment, and genome-wide TF abundance profiling, a rare but powerful strategy that deepens our understanding of disease mechanisms. This approach not only uncovers TFs driving pathology but also quantifies TF abundance across multiple genomic loci, enabling rapid and precise monitoring of their occupancy. Analyzing 21 RNA sequencing datasets, including 13 from early-stage T1D patients and 8 matched controls, we identified nearly 6 000 differentially expressed genes; 1 900 met strict significance and fold-change criteria and 211 are previously uncharacterized transcripts that distinguish T1D from healthy samples. Pathway analysis highlighted disruptions in beta-cell signaling, ALK-linked drug responses, neurodegenerative processes, and cytoskeletal organization. Upstream motif analysis revealed enrichment of Myc/Max, AP-1, SP-1, TATA-box, and NF-κB binding sites in upregulated genes, confirming their central role in the T1D transcriptome. By placing TFs at the core of our discovery platform, this work uncovers novel molecular drivers of T1D and identifies actionable biomarkers and therapeutic targets for personalized treatment.
DNA barcodes, which are short DNA strings, are regularly used as tags in pooled sequencing experiments to enable the identification of reads originating from the same sample. A crucial task in the subsequent analysis of pooled sequences is barcode calling, where one must identify the corresponding barcode for each read. This task is computationally challenging when the probability of synthesis and sequencing errors is high, like in photolithographic microarray synthesis. Identifying the most similar barcode for each read is a theoretically attractive solution for barcode calling. However, an all-to-all exact similarity calculation is practically infeasible for applications with millions of barcodes and billions of reads. Hence, several computational approaches for barcode calling have been proposed, but the challenge of developing an efficient and precise computational approach remains. Here, we propose a simple, yet highly effective new barcode calling approach that uses a filtering technique based on precomputed k-mer lists. We find that this approach has a slightly higher accuracy than the state-of-the-art approach, is more than 500 times faster than that, and allows barcode calling for one million barcodes and one billion reads per day on a server GPU.