CMU’s IWSLT 2024 Offline Speech Translation System: A Cascaded Approach For Long-Form Robustness

#1 CMU’s IWSLT 2024 Offline Speech Translation System: A Cascaded Approach For Long-Form Robustness [PDF] [Copy] [Kimi] [REL]

Authors: Brian Yan ; Patrick Fernandes ; Jinchuan Tian ; Siqi Ouyang ; William Chen ; Karen Livescu ; Lei Li ; Graham Neubig ; Shinji Watanabe

This work describes CMU’s submission to the IWSLT 2024 Offline Speech Translation (ST) Shared Task for translating English speech to German, Chinese, and Japanese text. We are the first participants to employ a long-form strategy which directly processes unsegmented recordings without the need for a separate voice-activity detection stage (VAD). We show that the Whisper automatic speech recognition (ASR) model has a hallucination problem when applied out-of-the-box to recordings containing non-speech noises, but a simple noisy fine-tuning approach can greatly enhance Whisper’s long-form robustness across multiple domains. Then, we feed English ASR outputs into fine-tuned NLLB machine translation (MT) models which are decoded using COMET-based Minimum Bayes Risk. Our VAD-free ASR+MT cascade is tested on TED talks, TV series, and workout videos and shown to outperform prior winning IWSLT submissions and large open-source models.

2024.iwslt-1.22@ACL

#1 CMU’s IWSLT 2024 Offline Speech Translation System: A Cascaded Approach For Long-Form Robustness [PDF] [Copy] [Kimi] [REL]