Empowering Large Language Models for End-to-End Speech Translation Leveraging Synthetic Data

#1 Empowering Large Language Models for End-to-End Speech Translation Leveraging Synthetic Data [PDF²] [Copy] [Kimi²] [REL]

Authors: Yu Pu, Xiaoqian Liu, Guangyu Zhang, Zheng Yan, Wei-Qiang Zhang, Xie Chen

Speech-to-speech translation (S2ST) is a key technology for seamless cross-lingual communication. Traditional cascaded systems, which involve speech recognition, text translation, and speech synthesis, are prone to error propagation and latency. In this work, we present SLAM-TR, an end-to-end speech translation model which directly map input speech to output speech, eliminating the need for intermediate text representations. By fine-tuning from the large language model Qwen2-0.5B, SLAM-TR achieves superior performance over the cascaded baseline and state-of-the-art open-source models with minimal training time. Additionally, SLAM-TR demonstrates strong generalization, achieving an ASR-BLEU score of 8.20 on the FLEURS benchmark, outperforming both cascaded and open-source systems. In addition, addressing the challenge of limited natural speech translation data, we propose SynStard-1000, a 1,000-hour synthetic speech translation dataset.

Subject: INTERSPEECH.2025 - Language and Multimodal

pu25@interspeech_2025@ISCA

#1 Empowering Large Language Models for End-to-End Speech Translation Leveraging Synthetic Data [PDF2] [Copy] [Kimi2] [REL]

#1 Empowering Large Language Models for End-to-End Speech Translation Leveraging Synthetic Data [PDF²] [Copy] [Kimi²] [REL]