Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

#1 Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models [PDF¹] [Copy] [Kimi] [REL]

Authors: Gyeongje Cho, Yeonkyoun So, Chanwoo Park, Sangmin Lee, Sungmok Jung, Jaejin Lee

This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2025-06-18 04:40:44 UTC

2506.15138

#1 Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models [PDF1] [Copy] [Kimi] [REL]

#1 Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models [PDF¹] [Copy] [Kimi] [REL]