FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

#1 FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates [PDF¹] [Copy] [Kimi] [REL]

Authors: Jiaqi Li, Yao Qian, Yuxuan Hu, Leying Zhang, Xiaofei Wang, Heng Lu, Manthan Thakker, Jinyu Li, Sheng Zhao, Zhizheng Wu

Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that a major challenge for very low frame rate tokens is missing semantic information. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS. Demos are available at: https://flexicodec.github.io

Subject: Sound

Publish: 2025-10-01 14:56:18 UTC

2510.00981

#1 FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates [PDF1] [Copy] [Kimi] [REL]

#1 FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates [PDF¹] [Copy] [Kimi] [REL]