Scaling Spoken Language Models with Syllabic Speech Tokenization

#1 Scaling Spoken Language Models with Syllabic Speech Tokenization [PDF¹] [Copy] [Kimi²] [REL]

Authors: Nicholas Lee, Cheol Jun Cho, Alan W Black, Gopala K. Anumanchipalli

Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models. As the most successful LMs are based on the Transformer architecture, processing these long token streams with self-attention is expensive, as attention scales quadratically with sequence length. A recent SSL work introduces acoustic tokenization of speech at the syllable level, which is more interpretable and potentially more scalable with significant compression in token lengths (4-5 Hz). Yet, their value for spoken language modeling is not yet fully explored. We present the first systematic study of syllabic tokenization for spoken language modeling, evaluating models on a suite of SLU benchmarks while varying training data scale. Syllabic tokens can match or surpass the previous high-frame rate tokens while significantly cutting training and inference costs, achieving more than a 2x reduction in training time and a 5x reduction in FLOPs. Our findings highlight syllable-level language modeling as a promising path to efficient long-context spoken language models.

Subjects: Computation and Language , Audio and Speech Processing

Publish: 2025-09-30 17:59:09 UTC

2509.26634

#1 Scaling Spoken Language Models with Syllabic Speech Tokenization [PDF1] [Copy] [Kimi2] [REL]

#1 Scaling Spoken Language Models with Syllabic Speech Tokenization [PDF¹] [Copy] [Kimi²] [REL]