ou25@interspeech_2025@ISCA

Total: 1

#1 CMSP-ST: Cross-modal Mixup with Speech Purification for End-to-End Speech Translation [PDF1] [Copy] [Kimi1] [REL]

Authors: Jiale Ou, Hongying Zan

End-to-end speech translation (E2E ST) aims to directly convert speech in a source language into text in a target language, and its performance is constrained by the inherent modality gap. Existing methods attempt to align speech and text representations to perform cross-modal mixup at the token level, which overlooks the impact of redundant speech information. In this paper, we propose cross-modal mixup with speech purification for speech translation (CMSP-ST) to address this issue. Specifically, we remove the non-content features from speech through orthogonal projection and extract the purified speech features for cross-modal mixup. Additionally, we employ adversarial training under the Soft Alignment (S-Align) to relax the alignment granularity and improve robustness. Experimental results on the MuST-C En-De, CoVoST-2 Fr-En, and CoVoST-2 De-En benchmarks demonstrate that CMSP-ST effectively improves the speech translation performance of existing cross-modal mixup methods.

Subject: INTERSPEECH.2025 - Language and Multimodal