Viral Proteins Reveal Geometry of Protein Language Models

#1 Viral Proteins Reveal Geometry of Protein Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Arthur Bigot, Harmon Bhasin, Core Francisco Park, Eugene Shakhnovich, Dianzhuo Wang

Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.

Subjects: Machine Learning , Quantitative Methods

Publish: 2026-06-10 19:04:34 UTC

2606.12609

#1 Viral Proteins Reveal Geometry of Protein Language Models [PDF] [Copy] [Kimi] [REL]