MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

#1 MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment [PDF] [Copy] [Kimi] [REL]

The Unigram tokenizer uses an elegant representation which makes it straightforward to edit vocabularies, but its training is comparatively heavy and complex. We introduce MinGram (Minimalist Unigram), which keeps the token-list representation but simplifies training using a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step. This removes the suffix array, the forward-backward pass, and the iterative prune loop, leaving a procedure that requires little beyond tokenizer inference itself. By making token count the primary objective and using a Unigram score only as a tiebreak, MinGram keeps the compression of pure token-count methods while retaining much of the morphological alignment and downstream quality of probabilistic ones. Across six languages, MinGram compresses better than both BPE and standard Unigram, and a compression-oriented variant matches the strongest token-count compressors while retaining substantially higher morphological alignment. In controlled downstream language-model training, Unigram-family tokenizers, with MinGram among the best, consistently beat BPE in bits-per-byte.

Subject: Computation and Language

Publish: 2026-06-25 13:31:02 UTC

2606.27019

#1 MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment [PDF] [Copy] [Kimi] [REL]