Data and Model Centric Approaches for Expansion of Large Language Models to New languages

#1 Data and Model Centric Approaches for Expansion of Large Language Models to New languages [PDF] [Copy] [Kimi] [REL]

Authors: Anoop Kunchukuttan, Raj Dabre, Rudra Murthy, Mohammed Safi Ur Rahman Khan, Thanmay Jayakumar

Despite the increasing pace of Large Language Model (LLM) research, a vast majority of existing LLMs mainly support English alongside a handful of high resource languages, leaving a major gap for most low-resource languages. In this tutorial, we focus on approaches to expand the language coverage of LLMs. This provides an efficient and viable path to bring LLM technologies to low-resource languages, instead of training from scratch. We look at approaches at various stages of the LLM training pipeline, like tokenizer training, pre-training, instruction tuning, alignment, evaluation, etc., where adaptations are made to support new languages. We look at data-oriented approaches as well as model-oriented approaches. We hope that our tutorial enables researchers and practitioners to work on incorporating additional languages and tasks into existing LLMs to enhance inclusivity and coverage.

Subject: EMNLP.2025 - Tutorial Abstracts

2025.emnlp-tutorials.5@ACL

#1 Data and Model Centric Approaches for Expansion of Large Language Models to New languages [PDF] [Copy] [Kimi] [REL]