Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

#1 Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training [PDF⁴] [Copy] [Kimi³] [REL]

Authors: Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Yuqi Ren, Deyi Xiong

Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

Subject: Computation and Language

Publish: 2025-07-12 08:54:05 UTC

2507.09205

#1 Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training [PDF4] [Copy] [Kimi3] [REL]

#1 Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training [PDF⁴] [Copy] [Kimi³] [REL]