wan25@interspeech_2025@ISCA

Total: 1

#1 SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain [PDF] [Copy] [Kimi] [REL]

Authors: Zixiang Wan, Guochang Zhang, Yifeng He, Jianqiang Wei

Neural Audio Codecs (NACs) have gained growing attention in recent years as technologies for audio compression and audio representation in speech language models. While mainstream NACs typically require G-level computation and M-level parameters, the performance of lightweight and streaming NACs remains underexplored. This paper proposes SpecTokenizer, a lightweight streaming codec that operates in the compressed spectral domain. Composed solely of alternating CNN and RNN layers, SpecTokenizer achieves greater efficiency and better representational capability through multi-scale modeling in the compressed spectrum domain. At 4 kbps, the proposed SpecTokenizer achieves comparable or superior performance compared to the codec with state-of-the-art lightweight architecture while requiring only 20% of the computation and 10% of the parameters. Furthermore, it significantly outperforms the codec when using similar computational and storage resources.

Subject: INTERSPEECH.2025 - Speech Processing