Text-only adaptation in LLM-based ASR through text denoising

#1 Text-only adaptation in LLM-based ASR through text denoising [PDF²] [Copy] [Kimi³] [REL]

Authors: Sergio Burdisso, Esaú Villatoro-Tello, Andrés Carofilis, Shashi Kumar, Kadri Hacioglu, Srikanth Madikeri, Pradeep Rangappa, Manjunath K E, Petr Motlicek, Shankar Venkatesan, Andreas Stolcke

Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.

Subjects: Sound , Computation and Language , Machine Learning , Audio and Speech Processing

Publish: 2026-01-28 10:18:23 UTC

2601.20900

#1 Text-only adaptation in LLM-based ASR through text denoising [PDF2] [Copy] [Kimi3] [REL]

#1 Text-only adaptation in LLM-based ASR through text denoising [PDF²] [Copy] [Kimi³] [REL]