2025-12-08 | | Total: 20
This study introduces a novel epidemiological model that expands upon the Kermack-McKendrick model by incorporating the age of infection and reinfection. By including infection age, we can classify participants, which enables a more targeted analysis within the modeling framework. The reinfection term addresses the real-world occurrences of secondary or recurrent viral infections. In the theoretical part, we apply the contraction mapping principle, the dominated convergence theorem, and the properties of Volterra integral equations to derive analytical expressions for the number of newly infected individuals denoted by $N(t)$. Then, we establish a Volterra integral equation for $N(t)$ and study its initial conditions for both a single cohort and multiple cohorts. From this equation, we derive a method for identifying the effective reproduction number, denoted as $\mathcal{R}(t)$. In the practical aspect, we present two distinct methods and separately apply them to analyze the daily new infection cases from the 2003 SARS outbreak in Singapore and the cumulative number of deaths from the COVID-19 epidemic in China. This work effectively bridges theoretical epidemiology and computational modeling, providing a robust framework for analyzing infection dynamics influenced by infection-age-structured transmission and reinfection mechanisms.
Dormancy is a widespread adaptive strategy that enables populations to persist in fluctuating environments, yet how its benefits depend on the temporal structure of environmental variability remains unclear. We examine how dormancy interacts with environmental correlation times using a delayed-logistic model in which dormant individuals reactivate after a fixed lag while birth rates fluctuate under temporally correlated stochasticity. Numerical simulations and analytical calculations show that the combination of demographic memory and colored multiplicative noise generates a strongly non-monotonic dependence of fitness on dormancy duration, with three distinct performance regimes. Very short dormancy maximizes linear growth but amplifies fluctuations and extinction risk. Very long dormancy buffers environmental variability, greatly increasing mean extinction times despite slower growth. Strikingly, we find a broad band of intermediate dormancy durations that is maladaptive, simultaneously reducing both growth and persistence due to a mismatch between delay times and environmental autocorrelation. An evolutionary agent-based model confirms bistability between short- and long-dormancy strategies, which avoid intermediate lag times and evolve toward stable extremes. These results show that dormancy duration is not merely a life-history parameter but an adaptive mechanism tuned to environmental timescales, and that intermediate "dangerous middle" strategies can be inherently disfavored. More broadly, this work identifies a generic mechanism by which demographic delays interacting with correlated environmental variability produce a non-monotonic fitness landscape that selects for extreme timing strategies.
Summary: Modern omics experiments now involve multiple conditions and complex designs, producing an increasingly large set of differential expression and functional enrichment analysis results. However, no standardized data structure exists to store and contextualize these results together with their metadata, leaving researchers with an unmanageable and potentially non-reproducible collection of results that are difficult to navigate and/or share. Here we introduce DeeDeeExperiment, a new S4 class for managing and storing omics data analysis results, implemented within the Bioconductor ecosystem, which promotes interoperability, reproducibility and good documentation. This class extends the widely used SingleCellExperiment object by introducing dedicated slots for Differential Expression (DEA) and Functional Enrichment Analysis (FEA) results, allowing users to organize, store, and retrieve information on multiple contrasts and associated metadata within a single data object, ultimately streamlining the management and interpretation of many omics datasets. Availability and implementation: DeeDeeExperiment is available on Bioconductor under the MIT license (https://bioconductor.org/packages/DeeDeeExperiment), with its development version also available on Github (https://github.com/imbeimainz/DeeDeeExperiment).
A few million words suffice for children to acquire language. Yet, the brain mechanisms underlying this unique ability remain poorly understood. To address this issue, we investigate neural activity recorded from over 7,400 electrodes implanted in the brains of 46 children, teenagers, and adults for epilepsy monitoring, as they listened to an audiobook version of "The Little Prince". We then train neural encoding and decoding models using representations, derived either from linguistic theory or from large language models, to map the location, dynamics and development of the language hierarchy in the brain. We find that a broad range of linguistic features is robustly represented across the cortex, even in 2-5-year-olds. Crucially, these representations evolve with age: while fast phonetic features are already present in the superior temporal gyrus of the youngest individuals, slower word-level representations only emerge in the associative cortices of older individuals. Remarkably, this neuro-developmental trajectory is spontaneously captured by large language models: with training, these AI models learned representations that can only be identified in the adult human brain. Together, these findings reveal the maturation of language representations in the developing brain and show that modern AI systems provide a promising tool to model the neural bases of language acquisition.
Sorting cells based on their mechanical properties is essential for applications in disease diagnostics, cell therapy, and biomedical research. Deterministic Lateral Displacement (DLD) devices provide a label-free method for achieving such sorting, but their performance is highly sensitive to cell size and deformability. Designing effective DLD geometries often demands extensive trial-and-error experimentation, as even small variations in cellular mechanical traits can cause significant changes in migration behavior. To address this challenge, we propose a simulation-driven machine learning (ML) framework that predicts suitable DLD design candidates for a given cell type. Our approach integrates high-fidelity particle-based simulations to model cell deformation and migration through microfluidic pillar arrays with supervised ML models trained to estimate optimal geometries. By mapping mechanical parameters such as bending rigidity and shear modulus to deformation index and migration angle, the framework enables rapid, data-informed design of DLD systems. We also demonstrate a deployable web interface to make this tool accessible for real-world device prototyping.
Numerous diseases, particularly autoimmune disorders, are associated with the human leukocyte antigen (HLA), a small genomic region located on human chromosome 6. Adequate characterization of linkage disequilibrium (LD) in the HLA across populations is crucial for identifying genetic markers associated with specific traits and phenotypes. However, current LD measures often fail to capture HLA's structural complexity due to methodological limitations and sensitivity to low-frequency variants, marginal allele frequencies, and haplotype composition. To address these challenges, we introduced the Conditional Informatics Correlation Coefficient (CICC), which integrates conditional probability, information content, and haplotype-aware XOR logic to quantify LD robustly. When applied to high-resolution haploid genomes from the Human Pangenome Reference Consortium (HPRC), CICC revealed 10 novel high-LD regions in HLA. Further analyses using the 1000 Genomes Project and Genome Asia datasets identified nine strongly linked regions shared across five global populations-five in Class I and four in Class II. These results demonstrate CICC's ability to capture complex HLA LD structures across populations, highlighting its broad potential for disease gene mapping, population genomics, and guiding precision medicine.
Art has long played a profound role in shaping human emotion, cognition, and behavior. While visual arts such as painting and architecture have been studied through eye tracking, revealing distinct gaze patterns between experts and novices, analogous methods for auditory art forms remain underdeveloped. Music, despite being a pervasive component of modern life and culture, still lacks objective tools to quantify listeners' attention and perceptual focus during natural listening experiences. To our knowledge, this is the first attempt to decode selective attention to musical elements using naturalistic, studio-produced songs and a lightweight consumer-grade EEG device with only four electrodes. By analyzing neural responses during real world like music listening, we test whether decoding is feasible under conditions that minimize participant burden and preserve the authenticity of the musical experience. Our contributions are fourfold: (i) decoding music attention in real studio-produced songs, (ii) demonstrating feasibility with a four-channel consumer EEG, (iii) providing insights for music attention decoding, and (iv) demonstrating improved model ability over prior work. Our findings suggest that musical attention can be decoded not only for novel songs but also across new subjects, showing performance improvements compared to existing approaches under our tested conditions. These findings show that consumer-grade devices can reliably capture signals, and that neural decoding in music could be feasible in real-world settings. This paves the way for applications in education, personalized music technologies, and therapeutic interventions.
EEG recordings are inherently contaminated by artifacts such as ocular, muscular, and environmental noise, which obscure neural activity and complicate preprocessing. Artifact classification offers advantages in stability and transparency, providing a viable alternative to ICA-based methods that enable flexible use alongside human inspections and across various applications. However, artifact classification is limited by its training data as it requires extensive manual labeling, which cannot fully cover the diversity of real-world EEG. Semi-synthetic data (SSD) methods have been proposed to address this limitation, but prior approaches typically injected single artifact types using ICA components or required separately recorded artifact signals, reducing both the realism of the generated data and the applicability of the method. To overcome these issues, we introduce SSDLabeler, a framework that generates realistic, annotated SSDs by decomposing real EEG with ICA, epoch-level artifact verification using RMS and PSD criteria, and reinjecting multiple artifact types into clean data. When applied to train a multi-label artifact classifier, it improved accuracy on raw EEG across diverse conditions compared to prior SSD and raw EEG training, establishing a scalable foundation for artifact handling that captures the co-occurrence and complexity of real EEG.
The ongoing explosion of genome sequence data is transforming how we reconstruct and understand the histories of biological systems. Across biological scales, from individual cells to populations and species, trees-based models provide a common framework for representing ancestry. Once limited to species phylogenetics, "tree thinking" now extends deeply to population genomics and cell biology, revealing the genealogical structure of genetic and phenotypic variation within and across organisms. Recently, there have been great methodological and computational advances on tree-based methods, including methods for inferring ancestral recombination graphs in populations, phylogenetic frameworks for comparative genomics, and lineage-tracing techniques in developmental and cancer biology. Despite differences in data types and biological contexts, these approaches share core statistical and algorithmic challenges: efficiently inferring branching histories from genomic information, integrating temporal and spatial signals, and connecting genealogical structures to evolutionary and functional processes. Recognizing these shared foundations opens opportunities for cross-fertilization between fields that are traditionally studied in isolation. By examining how tree-based methods are applied across cellular, population, and species scales, we identify the conceptual parallels that unite them and the distinct challenges that each domain presents. These comparisons offer new perspectives that can inform algorithmic innovations and lead to more powerful inference strategies across the full spectrum of biological systems.
Energy-based models have become a central paradigm for understanding computation and stability in both theoretical neuroscience and machine learning. However, the energetic framework typically relies on symmetry in synaptic or weight matrices - a constraint that excludes biologically realistic systems such as excitatory-inhibitory (E-I) networks. When symmetry is relaxed, the classical notion of a global energy landscape fails, leaving the dynamics of asymmetric neural systems conceptually unanchored. In this work, we extend the energetic framework to asymmetric firing rate networks, revealing an underlying game-theoretic structure for the neural dynamics in which each neuron is an agent that seeks to minimize its own energy. In addition, we exploit rigorous stability principles from network theory to study regulation and balancing of neural activity in E-I networks. We combine the novel game-energetic interpretation and the stability results to revisit standard frameworks in theoretical neuroscience, such as the Wilson-Cowan and lateral inhibition models. These insights allow us to study cortical columns of lateral inhibition microcircuits as contrast enhancer - with the ability to selectively sharpen subtle differences in the environment through hierarchical excitation-inhibition interplay. Our results bridge energetic and game-theoretic views of neural computation, offering a pathway toward the systematic engineering of biologically grounded, dynamically stable neural architectures.
Accurate prediction of protein function is essential for elucidating molecular mechanisms and advancing biological and therapeutic discovery. Yet experimental annotation lags far behind the rapid growth of protein sequence data. Computational approaches address this gap by associating proteins with Gene Ontology (GO) terms, which encode functional knowledge through hierarchical relations and textual definitions. However, existing models often emphasize one modality over the other, limiting their ability to generalize, particularly to unseen or newly introduced GO terms that frequently arise as the ontology evolves, and making the previously trained models outdated. We present STAR-GO, a Transformer-based framework that jointly models the semantic and structural characteristics of GO terms to enhance zero-shot protein function prediction. STAR-GO integrates textual definitions with ontology graph structure to learn unified GO representations, which are processed in hierarchical order to propagate information from general to specific terms. These representations are then aligned with protein sequence embeddings to capture sequence-function relationships. STAR-GO achieves state-of-the-art performance and superior zero-shot generalization, demonstrating the utility of integrating semantics and structure for robust and adaptable protein function prediction. Code is available at https://github.com/boun-tabi-lifelu/stargo.
The unprecedented extension of the human lifespan necessitates a parallel evolution in how we quantify the quality of aging and its socioeconomic impact. Traditional metrics focusing on Healthspan (years free of disease) overlook the gradual erosion of physiological capacity that occurs even in the absence of illness, leading to declines in productivity and eventual lack of capacity to work. To address this critical gap, we introduce Peakspan: the age interval during which an individual maintains at least 90% of their peak functional performance in a specific physiological or cognitive domain. Our multi-system analysis reveals a profound misalignment: most biological systems reach maximal capacity in early adulthood, resulting in a Peakspan that is remarkably short relative to the total lifespan. This dissociation means humans now spend the majority of their adult lives in a "healthy but declined" state, characterized by a significant functional gap. We argue that extending Peakspan and developing strategies to restore function in post-peak individuals is the functional manifestation of rejuvenative biomedical progress and is essential for sustained economic growth in aging societies. Recognizing and tracking Peakspan, increasingly facilitated by artificial intelligence and foundational models of biological aging, is crucial for developing strategies to compress functional morbidity and maximize human potential across the life course.
We analyze a size-structured branching process in which individual cells grow exponentially according to a Feller square-root process and divide under general size-control mechanisms. We obtain exact expressions for the asymptotic population growth rate, the steady-state snapshot distribution of cell sizes, and the fluctuations of the total cell number. Our first result is that the population growth rate is exactly equal to the mean single-cell growth rate, for all noise strengths and for all division and size-regulation schemes that maintain size homeostasis. Thus square-root growth noise is neutral with respect to long-term fitness, in sharp contrast to models with size-independent stochastic growth rates. Second, we show that the steady-state population cell-size distribution is obtained from the deterministic inverse-square-law solution by a one-sided exponential convolution with kernel width set by the strength of growth fluctuations. Third, the mean-rescaled population size $N_t/\left\langle N_t\right\rangle$ converges to a stationary compound Poisson-exponential distribution that depends only on growth noise. This distribution, and hence the long-time shape of population-size fluctuations, is unchanged by division-size noise or asymmetric partitioning. These results identify Feller-type exponential growth with square-root noise as an exactly solvable benchmark for stochastic growth in size-controlled populations and provide concrete signatures that distinguish it from models with size-independent growth-rate noise.
Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate an autoregressive antibody language model, p-IgGen, and steer its generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose an hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs are sufficient for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.
This paper presents the Model Gateway, a management platform for managing machine learning (ML) and scientific computational models in the drug discovery pipeline. The platform supports Large Language Model (LLM) Agents and Generative AI-based tools to perform ML model management tasks in our Machine Learning operations (MLOps) pipelines, such as the dynamic consensus model, a model that aggregates several scientific computational models, registration and management, retrieving model information, asynchronous submission/execution of models, and receiving results once the model complete executions. The platform includes a Model Owner Control Panel, Platform Admin Tools, and Model Gateway API service for interacting with the platform and tracking model execution. The platform achieves a 0% failure rate when testing scaling beyond 10k simultaneous application clients consume models. The Model Gateway is a fundamental part of our model-driven drug discovery pipeline. It has the potential to significantly accelerate the development of new drugs with the maturity of our MLOps infrastructure and the integration of LLM Agents and Generative AI tools.
Healthcare AI systems have historically faced challenges in merging contextual reasoning, long-term state management, and human-verifiable workflows into a cohesive framework. This paper introduces a completely innovative architecture and concept: combining the Model Context Protocol (MCP) with a specific clinical application, known as MCP-AI. This integration allows intelligent agents to reason over extended periods, collaborate securely, and adhere to authentic clinical logic, representing a significant shift away from traditional Clinical Decision Support Systems (CDSS) and prompt-based Large Language Models (LLMs). As healthcare systems become more complex, the need for autonomous, context-aware clinical reasoning frameworks has become urgent. We present MCP-AI, a novel architecture for explainable medical decision-making built upon the Model Context Protocol (MCP) a modular, executable specification for orchestrating generative and descriptive AI agents in real-time workflows. Each MCP file captures clinical objectives, patient context, reasoning state, and task logic, forming a reusable and auditable memory object. Unlike conventional CDSS or stateless prompt-based AI systems, MCP-AI supports adaptive, longitudinal, and collaborative reasoning across care settings. MCP-AI is validated through two use cases: (1) diagnostic modeling of Fragile X Syndrome with comorbid depression, and (2) remote coordination for Type 2 Diabetes and hypertension. In either scenario, the protocol facilitates physician-in-the-loop validation, streamlines clinical processes, and guarantees secure transitions of AI responsibilities between healthcare providers. The system connects with HL7/FHIR interfaces and adheres to regulatory standards, such as HIPAA and FDA SaMD guidelines. MCP-AI provides a scalable basis for interpretable, composable, and safety-oriented AI within upcoming clinical environments.
Digital assays represent a shift from traditional diagnostics and enable the precise detection of low-abundance analytes, critical for early disease diagnosis and personalized medicine, through discrete counting of biomolecular reporters. Within this paradigm, we present a particle counting algorithm for nanoparticle based imaging assays, formulated as a multiple-hypothesis statistical test under an explicit image-formation model and evaluated using a penalized likelihood rule. In contrast to thresholding or machine learning methods, this approach requires no training data or empirical parameter tuning, and its outputs remain interpretable through direct links to imaging physics and statistical decision theory. Through numerical simulations we demonstrate robust count accuracy across weak signals, variable backgrounds, magnification changes and moderate PSF mismatch. Particle resolvability tests further reveal characteristic error modes, including under-counting at very small separations and localized over-counting near the resolution limit. Practically, we also confirm the algorithm's utility, through application to experimental dark-field images comprising a nanoparticle-based assay for detection of DNA biomarkers derived from SARS-CoV-2. Statistically significant differences in particle count distributions are observed between control and positive samples. Full count statistics obtained further exhibit consistent over-dispersion, and provide insight into non-specific and target-induced particle aggregation. These results establish our method as a reliable framework for nanoparticle-based detection assays in digital molecular diagnostics.
Approximately 1.4 Ga after life first appeared, atmospheric oxygen suddenly jumped by more than an order of magnitude over a 20-50 Ma period. The contrast between these two timescales does not seem to be due to any sudden, large amplitude change in external forcing. However, it could be due to processes intrinsic to the geobiological system itself, namely, positive feedback between atmospheric oxygen and photosynthetic bacteria: More oxygen leads to more photosynthesis, which leads to more oxygen, and so on. Already-published feedbacks include buildup of an ozone shield and nutrient production by oxidative weathering. The feedback proposed here is the 15-fold greater efficiency of aerobic vs anaerobic respiration and the tight coupling of respiration and photosynthesis inside the cell. As in the climate system, feedback leads to tipping points, where a rapid, large amplitude change in the state of the system occurs. For the geobiological system, the GOE is the tipping point, and the long buildup before the GOE is the gradual oxidation of the crust and ocean, due either to burial of organic matter, oxidation of volcanic gases, or escape of hydrogen to space. The feedback hypothesis is a framework for interpreting observations leading to the GOE.
In the past decade a surge in the amount of electronic health record (EHR) data in the United States, attributed to a favorable policy environment created by the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 and the 21st Century Cures Act of 2016. Clinical notes for patients' assessments, diagnoses, and treatments are captured in these EHRs in free-form text by physicians, who spend a considerable amount of time entering and editing them. Manually writing clinical notes takes a considerable amount of a doctor's valuable time, increasing the patient's waiting time and possibly delaying diagnoses. Large language models (LLMs) possess the ability to generate news articles that closely resemble human-written ones. We investigate the usage of Chain-of-Thought (CoT) prompt engineering to improve the LLM's response in clinical note generation. In our prompts, we use as input International Classification of Diseases (ICD) codes and basic patient information. We investigate a strategy that combines the traditional CoT with semantic search results to improve the quality of generated clinical notes. Additionally, we infuse a knowledge graph (KG) built from clinical ontology to further enrich the domain-specific knowledge of generated clinical notes. We test our prompting technique on six clinical cases from the CodiEsp test dataset using GPT-4 and our results show that it outperformed the clinical notes generated by standard one-shot prompts.
Given a sequence $s_1$ of $n$ letters drawn i.i.d. from an alphabet of size $σ$ and a mutated substring $s_2$ of length $m < n$, we often want to recover the mutation history that generated $s_2$ from $s_1$. Modern sequence aligners are widely used for this task, and many employ the seed-chain-extend heuristic with $k$-mer seeds. Previously, Shaw and Yu showed that optimal linear-gap cost chaining can produce a chain with $1 - O\left(\frac{1}{\sqrt{m}}\right)$ recoverability, the proportion of the mutation history that is recovered, in $O\left(mn^{2.43θ} \log n\right)$ expected time, where $θ< 0.206$ is the mutation rate under a substitution-only channel and $s_1$ is assumed to be uniformly random. However, a gap remains between theory and practice, since real genomic data includes insertions and deletions (indels), and yet seed-chain-extend remains effective. In this paper, we generalize those prior results by introducing mathematical machinery to deal with the two new obstacles introduced by indel channels: the dependence of neighboring anchors and the presence of anchors that are only partially correct. We are thus able to prove that the expected recoverability of an optimal chain is $\ge 1 - O\Bigl(\frac{1}{\sqrt{m}}\Bigr)$ and the expected runtime is $O(mn^{3.15 \cdot θ_T}\log n)$, when the total mutation rate given by the sum of the substitution, insertion, and deletion mutation rates ($θ_T = θ_i + θ_d + θ_s$) is less than $0.159$.