2506.17789

Total: 1

#1 Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights [PDF] [Copy] [Kimi] [REL]

Authors: N J Karthika, Maharaj Brahma, Rohit Saluja, Ganesh Ramakrishnan, Maunendra Sankar Desarkar

Tokenization plays a pivotal role in multilingual NLP. However, existing tokenizers are often skewed towards high-resource languages, limiting their effectiveness for linguistically diverse and morphologically rich languages such as those in the Indian subcontinent. This paper presents a comprehensive intrinsic evaluation of tokenization strategies across 17 Indian languages. We quantify the trade-offs between bottom-up and top-down tokenizer algorithms (BPE and Unigram LM), effects of vocabulary sizes, and compare strategies of multilingual vocabulary construction such as joint and cluster-based training. We also show that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages. Our study provides practical insights for building more fair, efficient, and linguistically informed tokenizers for multilingual NLP.

Subject: Computation and Language

Publish: 2025-06-21 18:47:33 UTC