LM2Protein: A Structure-to-Token Protein Large Language Model

#1 LM2Protein: A Structure-to-Token Protein Large Language Model [PDF] [Copy] [Kimi] [REL]

Authors: Chang Zhou, Yuheng Shan, Pengan Chen, Xiangyu Shi, Zikang Wang, Yanting Li, Jiyue Jiang

Proteins are critical for various molecular functions, relying on their precise tertiary structures. This structure-sequence relationship is complex and degenerate, meaning multiple sequences can fold into a similar structure. The challenges in protein prediction, design, and modification increase with sequence complexity, while research on RNA-protein interactions, especially RNA-binding proteins (RBPs), is gaining importance. Large-scale pre-trained language models (LLMs) have shown promising results in handling biological sequences by treating them as natural language, though integrating spatial structures remains complex due to the need for specialized visual and 3D modeling approaches. We introduce a method to integrate protein 3D structural data within a sequence processing framework, converting 3D coordinates into discrete structure tokens using a VQ-VAE-like network. This simplifies the handling of 3D data, avoiding complex pipelines and facilitating a unified sequence-to-sequence processing model. Our approach demonstrates strong performance across a range of tasks, achieving high sequence recovery in inverse folding and protein-conditioned RNA design. These outstanding results demonstrate significant potential for application in complex biological systems research.

Subject: EMNLP.2025 - Findings

2025.findings-emnlp.369@ACL

#1 LM2Protein: A Structure-to-Token Protein Large Language Model [PDF] [Copy] [Kimi] [REL]