Total: 1
Speech-to-speech translation (S2ST) is a key technology for seamless cross-lingual communication. Traditional cascaded systems, which involve speech recognition, text translation, and speech synthesis, are prone to error propagation and latency. In this work, we present SLAM-TR, an end-to-end speech translation model which directly map input speech to output speech, eliminating the need for intermediate text representations. By fine-tuning from the large language model Qwen2-0.5B, SLAM-TR achieves superior performance over the cascaded baseline and state-of-the-art open-source models with minimal training time. Additionally, SLAM-TR demonstrates strong generalization, achieving an ASR-BLEU score of 8.20 on the FLEURS benchmark, outperforming both cascaded and open-source systems. In addition, addressing the challenge of limited natural speech translation data, we propose SynStard-1000, a 1,000-hour synthetic speech translation dataset.