Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding

#1 Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding [PDF²] [Copy] [Kimi¹] [REL]

Authors: Zhang Jiasheng, Delvin Zhang, Shuang Liang, Zhengpin Li, ZHITAO YING, Jie Shao

Protein language models often struggle to capture biological functions due to their lack of factual knowledge (e.g., gene descriptions). Existing solutions leverage protein knowledge graphs (PKGs) as auxiliary pre-training objectives, but lack explicit integration of task-oriented knowledge, making them suffer from limited knowledge exploitation and catastrophic forgetting. The root cause is that they fail to align PKGs with task-specific data, forcing their knowledge modeling to adapt to the knowledge-isolated nature of downstream tasks. In this paper, we propose Knowledge-aware retrieval augmented protein language model (Kara), achieving the first task-oriented and explicit integration of PKGs and protein language models. With a knowledge retriever learning to predict linkages between PKG and task proteins, Kara unifies the knowledge integration of the pre-training and fine-tuning stages with a structure-based regularization, mitigating catastrophic forgetting. To ensure task-oriented integration, Kara uses contextualized virtual tokens to extract graph context as task-specific knowledge for new proteins. Experiments show that Kara outperforms existing knowledge-enhanced models in 6 representative tasks, achieving on average 5.1% improvements.

Subject: ICML.2025 - Poster

TJHhXzTcQe@OpenReview

#1 Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding [PDF2] [Copy] [Kimi1] [REL]

#1 Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding [PDF²] [Copy] [Kimi¹] [REL]