Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models

#1 Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Haoqin Sun, Chenyang Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, Longyue Wang, Weihua Luo, Yong Qin

Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.

Subject: Sound

Publish: 2026-02-05 06:50:49 UTC

2602.05373

#1 Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models [PDF1] [Copy] [Kimi1] [REL]

#1 Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models [PDF¹] [Copy] [Kimi¹] [REL]