End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model

#1 End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model [PDF] [Copy] [Kimi] [REL]

Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

We present an end-to-end speech translation (ST) model that uses a large language model (LLM) to guide the translation process. Recent advances in LLMs have shown strong contextual understanding and robustness to noisy text, making them beneficial for mitigating automatic speech recognition (ASR) errors. Building on these strengths, we develop an LLM-driven ST model within an encoder-decoder framework, with the encoder handling an auxiliary ASR task and the decoder incorporating an LLM at its front end. Here, the encoder generates an ASR hypothesis that cues the LLM to perform machine translation. The LLM output is then fed into the decoder to yield the final translation. This two-pass design capitalizes on the LLM's robust and accurate translation capabilities, while enabling end-to-end optimization tailored to specific ST tasks. Experimental results on various ST tasks reveal significant performance gains with our LLM integration, and extensive analyses further validate our approach.

Subject: INTERSPEECH.2025 - Language and Multimodal

higuchi25@interspeech_2025@ISCA

#1 End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model [PDF] [Copy] [Kimi] [REL]