Total: 1
The integration of large language models (LLMs) with ASR is increasingly explored, but remains challenging for low-resource languages. Loose coupling via N-best lists fails due to high ASR errors, while tight coupling using audio tokens requires too much data. A promising middle ground SALSA enables synchronous decoding by cascading ASR and LLM decoders via projection layers, overcoming differing tokenizations. In this work, we show that SALSA fails when the ASR and LLM tokenizations have a large token fertility gap. This problem particularly plagues low-resource languages; the ASR decoder overtokenizes LLM tokens starving the LLM decoder of sufficient audio context. To address this, we propose SKIP-SALSA, that adaptively skips ahead and advances the ASR decoder states to synchronize with the LLM. The skip size is learned via a lightweight skip predictor. SKIP-SALSA significantly improves ASR performance on multiple low-resource languages yielding up to 20% over a strong baseline.