IWSLT.2024

| Total: 37

#1 FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN [PDF] [Copy] [Kimi] [REL]

Authors: Ibrahim Said Ahmad ; Antonios Anastasopoulos ; Ondřej Bojar ; Claudia Borg ; Marine Carpuat ; Roldano Cattoni ; Mauro Cettolo ; William Chen ; Qianqian Dong ; Marcello Federico ; Barry Haddow ; Dávid Javorský ; Mateusz Krubiński ; Tsz Kim Lam ; Xutai Ma ; Prashant Mathur ; Evgeny Matusov ; Chandresh Maurya ; John McCrae ; Kenton Murray ; Satoshi Nakamura ; Matteo Negri ; Jan Niehues ; Xing Niu ; Atul Kr. Ojha ; John Ortega ; Sara Papi ; Peter Polák ; Adam Pospíšil ; Pavel Pecina ; Elizabeth Salesky ; Nivedita Sethiya ; Balaram Sarkar ; Jiatong Shi ; Claytone Sikasote ; Matthias Sperber ; Sebastian Stüker ; Katsuhito Sudoh ; Brian Thompson ; Alex Waibel ; Shinji Watanabe ; Patrick Wilken ; Petr Zemánek ; Rodolfo Zevallos

This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 17 teams whose submissions are documented in 27 system papers. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

#2 Pause-Aware Automatic Dubbing using LLM and Voice Cloning [PDF1] [Copy] [Kimi] [REL]

Authors: Yuang Li ; Jiaxin Guo ; Min Zhang ; Ma Miaomiao ; Zhiqiang Rao ; Weidong Zhang ; Xianghui He ; Daimeng Wei ; Hao Yang

Automatic dubbing aims to translate the speech of a video into another language, ensuring the new speech naturally fits the original video. This paper details Huawei Translation Services Center’s (HW-TSC) submission for IWSLT 2024’s automatic dubbing task, under an unconstrained setting. Our system’s machine translation (MT) component utilizes a Transformer-based MT model and an LLM-based post-editor to produce translations of varying lengths. The text-to-speech (TTS) component employs a VITS-based TTS model and a voice cloning module to emulate the original speaker’s vocal timbre. For enhanced dubbing synchrony, we introduce a parsing-informed pause selector. Finally, we rerank multiple results based on lip-sync error distance (LSE-D) and character error rate (CER). Our system achieves LSE-D of 10.75 and 12.19 on subset1 and subset2 of DE-EN test sets respectively, superior to last year’s best system.

#3 NICT’s Cascaded and End-To-End Speech Translation Systems using Whisper and IndicTrans2 for the Indic Task [PDF] [Copy] [Kimi] [REL]

Authors: Raj Dabre ; Haiyue Song

This paper presents the NICT’s submission for the IWSLT 2024 Indic track, focusing on three speech-to-text (ST) translation directions: English to Hindi, Bengali, and Tamil. We aim to enhance translation quality in this low-resource scenario by integrating state-of-the-art pre-trained automated speech recognition (ASR) and text-to-text machine translation (MT) models. Our cascade system incorporates a Whisper model fine-tuned for ASR and an IndicTrans2 model fine-tuned for MT. Additionally, we propose an end-to-end system that combines a Whisper model for speech-to-text conversion with knowledge distilled from an IndicTrans2 MT model. We first fine-tune the IndicTrans2 model to generate pseudo data in Indic languages. This pseudo data, along with the original English speech data, is then used to fine-tune the Whisper model. Experimental results show that the cascaded system achieved a BLEU score of 51.0, outperforming the end-to-end model, which scored 19.1 BLEU. Moreover, the analysis indicates that applying knowledge distillation from the IndicTrans2 model to the end-to-end ST model improves the translation quality by about 0.7 BLEU.

#4 Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems [PDF] [Copy] [Kimi] [REL]

Authors: Frank Palma Gomez ; Ramon Sanabria ; Yun-hsuan Sung ; Daniel Cer ; Siddharth Dalmia ; Gustavo Hernandez Abrego

Large language models (LLMs) are trained on text-only data that go far beyond the languages with paired speech and text data. At the same time, Dual Encoder (DE) based retrieval systems project queries and documents into the same embedding space and have demonstrated their success in retrieval and bi-text mining. To match speech and text in many languages, we propose using LLMs to initialize multi-modal DE retrieval systems. Unlike traditional methods, our system doesn’t require speech data during LLM pre-training and can exploit LLM’s multilingual text understanding capabilities to match speech and text in languages unseen during retrieval training. Our multi-modal LLM-based retrieval system is capable of matching speech and text in 102 languages despite only training on 21 languages. Our system outperforms previous systems trained explicitly on all 102 languages. We achieve a 10% absolute improvement in Recall@1 averaged across these languages. Additionally, our model demonstrates cross-lingual speech and text matching, which is further enhanced by readily available machine translation data.

#5 Conditioning LLMs with Emotion in Neural Machine Translation [PDF] [Copy] [Kimi] [REL]

Authors: Charles Brazier ; Jean-Luc Rouas

Large Language Models (LLMs) have shown remarkable performance in Natural Language Processing tasks, including Machine Translation (MT). In this work, we propose a novel MT pipeline that integrates emotion information extracted from a Speech Emotion Recognition (SER) model into LLMs to enhance translation quality. We first fine-tune five existing LLMs on the Libri-trans dataset and select the most performant model. Subsequently, we augment LLM prompts with different dimensional emotions and train the selected LLM under these different configurations. Our experiments reveal that integrating emotion information, especially arousal, into LLM prompts leads to notable improvements in translation quality.

#6 The NYA’s Offline Speech Translation System for IWSLT 2024 [PDF] [Copy] [Kimi] [REL]

Authors: Yingxin Zhang ; Guodong Ma ; Binbin Du

This paper reports the NYA’s submissions to IWSLT 2024 Offline Speech Translation (ST) task on the sub-tasks including English to Chinese, Japanese, and German. In detail, we participate in the unconstrained training track using the cascaded ST structure. For the automatic speech recognition (ASR) model, we use the Whisper large-v3 model. For the neural machine translation (NMT) model, the wider and deeper Transformer is adapted as the backbone model. Furthermore, we use data augmentation technologies to augment training data and data filtering strategies to improve the quality of training data. In addition, we explore many MT technologies such as Back Translation, Forward Translation, R-Drop, and Domain Adaptation.

#7 Improving the Quality of IWLST 2024 Cascade Offline Speech Translation and Speech-to-Speech Translation via Translation Hypothesis Ensembling with NMT models and Large Language Models [PDF1] [Copy] [Kimi] [REL]

Authors: Zhanglin Wu ; Jiaxin Guo ; Daimeng Wei ; Zhiqiang Rao ; Zongyao Li ; Hengchao Shang ; Yuanchang Luo ; Shaojun Li ; Hao Yang

This paper presents HW-TSC’s submission to the IWSLT 2024 Offline Speech Translation Task and Speech-to-Speech Translation Task. The former includes three translation directions: English to German, English to Chinese, and English to Japanese, while the latter only includes the translation direction of English to Chinese. We attend all three tracks (Constraint training, Constrained with Large Language Models training, and Unconstrained training) of offline speech translation task, using the cascade model architecture. Under the constrained training track, we train an ASR model from scratch, and then employ R-Drop and domain data selection to train the NMT model. In the constrained with Large Language Models training track, we use Wav2vec 2.0 and mBART50 for ASR model training initialization, and then train the LLama2-7B-based MT model using continuous training with sentence-aligned parallel data, supervised fine-tuning, and contrastive preference optimization. In the unconstrained training track, we fine-tune the whisper model for speech recognition, and then ensemble the translation results of NMT models and LLMs to produce superior translation output. For the speech-to-speech translation Task, we initially employ the offline speech translation system described above to generate the translated text. Then, we utilize the VITS model to generate the corresponding speech and employ the OpenVoice model for timbre cloning.

#8 HW-TSC’s Speech to Text Translation System for IWSLT 2024 in Indic track [PDF] [Copy] [Kimi] [REL]

Authors: Bin Wei ; Zongyao Li ; Jiaxin Guo ; Daimeng Wei ; Zhanglin Wu ; Xiaoyu Chen ; Zhiqiang Rao ; Shaojun Li ; Yuanchang Luo ; Hengchao Shang ; Hao Yang ; Yanfei Jiang

This article introduces the process of HW-TSC and the results of IWSLT 2024 Indic Track Speech to Text Translation. We designed a cascade system consisting of an ASR model and a machine translation model to translate speech from one language to another. For the ASR part, we directly use whisper large v3 as our ASR model. Our main task is to optimize the machine translation model (en2ta, en2hi, en2bn). In the process of optimizing the translation model, we first use bilingual corpus to train the baseline model. Then we use monolingual data to construct pseudo-corpus data to further enhance the baseline model. Finally, we filter the parallel corpus data through the labse filtering method and finetune the model again, which can further improve the bleu value. We also selected domain data from bilingual corpus to finetune previous model to achieve the best results.

#9 Multi-Model System for Effective Subtitling Compression [PDF] [Copy] [Kimi] [REL]

Authors: Carol-Luca Gasan ; Vasile Păiș

This paper presents RACAI’s system used for the shared task of ‘Subtitling track: Subtitle Compression’ (the English to Spanish language direction), organized as part of ‘the 21st edition of The International Conference on Spoken Language Translation (IWSLT 2024)’. The proposed system consists of multiple models whose outputs are then ensembled using an algorithm, which has the purpose of maximizing the similarity of the initial and resulting text. We present the introduced datasets and the models’ training strategy, along with the reported results on the proposed test set.

#10 FBK@IWSLT Test Suites Task: Gender Bias evaluation with MuST-SHE [PDF] [Copy] [Kimi] [REL]

Authors: Beatrice Savoldi ; Marco Gaido ; Matteo Negri ; Luisa Bentivogli

This paper presents the FBK contribution to the IWSLT-2024 ‘Test suites’ shared subtask, part of the Offline Speech Translation Task. Our contribution consists of the MuST-SHE-IWSLT24 benchmark evaluation, designed to assess gender bias in speech translation. By focusing on the en-de language pair, we rely on a newly created test suite to investigate systems’ ability to correctly translate feminine and masculine gender. Our results indicate that – under realistic conditions – current ST systems achieve reasonable and comparable performance in correctly translating both feminine and masculine forms when contextual gender information is available. For ambiguous references to the speaker, however, we attest a consistent preference towards masculine gender, thus calling for future endeavours on the topic. Towards this goal we make MuST-SHE-IWSLT24 freely available at: https://mt.fbk.eu/must-she/

#11 SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation [PDF] [Copy] [Kimi] [REL]

Authors: Sara Papi ; Marco Gaido ; Matteo Negri ; Luisa Bentivogli

This paper describes the FBK’s participation in the Simultaneous Translation Evaluation Campaign at IWSLT 2024. For this year’s submission in the speech-to-text translation (ST) sub-track, we propose SimulSeamless, which is realized by combining AlignAtt and SeamlessM4T in its medium configuration. The SeamlessM4T model is used ‘off-the-shelf’ and its simultaneous inference is enabled through the adoption of AlignAtt, a SimulST policy based on cross-attention that can be applied without any retraining or adaptation of the underlying model for the simultaneous task. We participated in all the Shared Task languages (English->German, Japanese, Chinese, and Czech->English), achieving acceptable or even better results compared to last year’s submissions. SimulSeamless, covering more than 143 source languages and 200 target languages, is released at: https://github.com/hlt-mt/FBK-fairseq/.

#12 The SETU-DCU Submissions to IWSLT 2024 Low-Resource Speech-to-Text Translation Tasks [PDF] [Copy] [Kimi] [REL]

Authors: Maria Zafar ; Antonio Castaldo ; Prashanth Nayak ; Rejwanul Haque ; Neha Gajakos ; Andy Way

Natural Language Processing (NLP) research and development has experienced rapid progression in the recent times due to advances in deep learning. The introduction of pre-trained large language models (LLMs) is at the core of this transformation, significantly enhancing the performance of machine translation (MT) and speech technologies. This development has also led to fundamental changes in modern translation and speech tools and their methodologies. However, there remain challenges when extending this progress to underrepresented dialects and low-resource languages, primarily due to the need for more data. This paper details our submissions to the IWSLT speech translation (ST) tasks. We used the Whisper model for the automatic speech recognition (ASR) component. We then used mBART and NLLB as cascaded systems for utilising their MT capabilities. Our research primarily focused on exploring various dialects of low-resource languages and harnessing existing resources from linguistically related languages. We conducted our experiments for two morphologically diverse language pairs: Irish-to-English and Maltese-to-English. We used BLEU, chrF and COMET for evaluating our MT models.

#13 Automatic Subtitling and Subtitle Compression: FBK at the IWSLT 2024 Subtitling track [PDF] [Copy] [Kimi] [REL]

Authors: Marco Gaido ; Sara Papi ; Mauro Cettolo ; Roldano Cattoni ; Andrea Piergentili ; Matteo Negri ; Luisa Bentivogli

The paper describes the FBK submissions to the Subtitling track of the 2024 IWSLT Evaluation Campaign, which covers both the Automatic Subtitling and the Subtitle Compression task for two language pairs: English to German (en-de) and English to Spanish (en-es). For the Automatic Subtitling task, we submitted two systems: i) a direct model, trained in constrained conditions, that produces the SRT files from the audio without intermediate outputs (e.g., transcripts), and ii) a cascade solution that integrates only free-to-use components, either taken off-the-shelf or developed in-house. Results show that, on both language pairs, our direct model outperforms both cascade and direct systems trained in constrained conditions in last year’s edition of the campaign, while our cascade solution is competitive with the best 2023 runs. For the Subtitle Compression task, our primary submission involved prompting a Large Language Model (LLM) in zero-shot mode to shorten subtitles that exceed the reading speed limit of 21 characters per second. Our results highlight the challenges inherent in shrinking out-of-context sentence fragments that are automatically generated and potentially error-prone, underscoring the need for future studies to develop targeted solutions.

#14 UM IWSLT 2024 Low-Resource Speech Translation: Combining Maltese and North Levantine Arabic [PDF] [Copy] [Kimi] [REL]

Authors: Sara Nabhani ; Aiden Williams ; Miftahul Jannat ; Kate Rebecca Belcher ; Melanie Galea ; Anna Taylor ; Kurt Micallef ; Claudia Borg

The IWSLT low-resource track encourages innovation in the field of speech translation, particularly in data-scarce conditions. This paper details our submission for the IWSLT 2024 low-resource track shared task for Maltese-English and North Levantine Arabic-English spoken language translation using an unconstrained pipeline approach. Using language models, we improve ASR performance by correcting the produced output. We present a 2 step approach for MT using data from external sources showing improvements over baseline systems. We also explore transliteration as a means to further augment MT data and exploit the cross-lingual similarities between Maltese and Arabic.

#15 UOM-Constrained IWSLT 2024 Shared Task Submission - Maltese Speech Translation [PDF] [Copy] [Kimi] [REL]

Authors: Kurt Abela ; Md Abdur Razzaq Riyadh ; Melanie Galea ; Alana Busuttil ; Roman Kovalev ; Aiden Williams ; Claudia Borg

This paper presents our IWSLT-2024 shared task submission on the low-resource track. This submission forms part of the constrained setup; implying limited data for training. Following the introduction, this paper consists of a literature review defining previous approaches to speech translation, as well as their application to Maltese, followed by the defined methodology, evaluation and results, and the conclusion. A cascaded submission on the Maltese to English language pair is presented; consisting of a pipeline containing: a DeepSpeech 1 Automatic Speech Recognition (ASR) system, a KenLM model to optimise the transcriptions, and finally an LSTM machine translation model. The submission achieves a 0.5 BLEU score on the overall test set, and the ASR system achieves a word error rate of 97.15%. Our code is made publicly available.

#16 Compact Speech Translation Models via Discrete Speech Units Pretraining [PDF] [Copy] [Kimi] [REL]

Authors: Tsz Kin Lam ; Alexandra Birch ; Barry Haddow

We propose a pretraining method to use Self-Supervised Speech (SSS) model to creating more compact Speech-to-text Translation. In contrast to using the SSS model for initialization, our method is more suitable to memory constrained scenario such as on-device deployment. Our method is based on Discrete Speech Units (DSU) extracted from the SSS model. In the first step, our method pretrains two smaller encoder-decoder models on 1) Filterbank-to-DSU (Fbk-to-DSU) and 2) DSU-to-Translation (DSU-to-Trl) data respectively. The DSU thus become the distillation inputs of the smaller models. Subsequently, the encoder from the Fbk-to-DSU model and the decoder from the DSU-to-Trl model are taken to initialise the compact model. Finally, the compact model is finetuned on the paired Fbk-Trl data. In addition to being compact, our method requires no transcripts, making it applicable to low-resource settings. It also avoids speech discretization in inference and is more robust to the DSU tokenization. Evaluation on CoVoST-2 (X-En) shows that our method has consistent improvement over the baseline in three metrics while being compact i.e., only half the SSS model size.

#17 QUESPA Submission for the IWSLT 2024 Dialectal and Low-resource Speech Translation Task [PDF] [Copy] [Kimi] [REL]

Authors: John E. Ortega ; Rodolfo Joel Zevallos ; Ibrahim Said Ahmad ; William Chen

This article describes the QUESPA team speech translation (ST) submissions for the Quechua to Spanish (QUE–SPA) track featured in the Evaluation Campaign of IWSLT 2024: dialectal and low-resource speech translation. Two main submission types were supported in the campaign: constrained and unconstrained. This is our second year submitting our ST systems to the IWSLT shared task and we feel that we have achieved novel performance, surpassing last year’s submissions. Again, we were able to submit six total systems of which our best (primary) constrained system consisted of an ST model based on the Fairseq S2T framework where the audio representations were created using log mel-scale filter banks as features and the translations were performed using a transformer. The system was similar to last year’s submission with slight configuration changes, allowing us to achieve slightly higher performance (2 BLEU). Contrastingly, we were able to achieve much better performance than last year on the unconstrained task using a larger pre-trained language (PLM) model for ST (without cascading) and the inclusion of parallel QUE–SPA data found on the internet. The fine-tuning of Microsoft’s SpeechT5 model in a ST setting along with the addition of new data and a data augmentation technique allowed us to achieve 19.7 BLEU. Additionally, we present the other four submissions (2 constrained and 2 unconstrained) which are part of additional efforts of hyper-parameter and configuration tuning on existent models and the inclusion of Whisper for speech recognition

#18 Speech Data from Radio Broadcasts for Low Resource Languages [PDF] [Copy] [Kimi] [REL]

Authors: Bismarck Bamfo Odoom ; Leibny Paola Garcia Perera ; Prangthip Hansanti ; Loic Barrault ; Christophe Ropers ; Matthew Wiesner ; Kenton Murray ; Alexandre Mourachko ; Philipp Koehn

We created a collection of speech data for 48 low resource languages. The corpus is extracted from radio broadcasts and processed with novel speech detection and language identification models based on a manually vetted subset of the audio for 10 languages. The data is made publicly available.

#19 JHU IWSLT 2024 Dialectal and Low-resource System Description [PDF] [Copy] [Kimi] [REL]

Authors: Nathaniel Romney Robinson ; Kaiser Sun ; Cihan Xiao ; Niyati Bafna ; Weiting Tan ; Haoran Xu ; Henry Li Xinyuan ; Ankur Kejriwal ; Sanjeev Khudanpur ; Kenton Murray ; Paul McNamee

Johns Hopkins University (JHU) submitted systems for all eight language pairs in the 2024 Low-Resource Language Track. The main effort of this work revolves around fine-tuning large and publicly available models in three proposed systems: i) end-to-end speech translation (ST) fine-tuning of Seamless4MT v2; ii) ST fine-tuning of Whisper; iii) a cascaded system involving automatic speech recognition with fine-tuned Whisper and machine translation with NLLB. On top of systems above, we conduct a comparative analysis on different training paradigms, such as intra-distillation for NLLB as well as joint training and curriculum learning for SeamlessM4T v2. Our results show that the best-performing approach differs by language pairs, but that i) fine-tuned SeamlessM4T v2 tends to perform best for source languages on which it was pre-trained, ii) multi-task training helps Whisper fine-tuning, iii) cascaded systems with Whisper and NLLB tend to outperform Whisper alone, and iv) intra-distillation helps NLLB fine-tuning.

#20 CMU’s IWSLT 2024 Simultaneous Speech Translation System [PDF1] [Copy] [Kimi] [REL]

Authors: Xi Xu ; Siqi Ouyang ; Brian Yan ; Patrick Fernandes ; William Chen ; Lei Li ; Graham Neubig ; Shinji Watanabe

This paper describes CMU’s submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner. Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. We employ a two-stage training approach: initially, we align the representations of speech and text, followed by full fine-tuning. Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.

#21 HW-TSC’s Submissions To the IWSLT2024 Low-resource Speech Translation Tasks [PDF] [Copy] [Kimi] [REL]

Authors: Zheng Jiawei ; Hengchao Shang ; Zongyao Li ; Zhanglin Wu ; Daimeng Wei ; Zhiqiang Rao ; Shaojun Li ; Jiaxin Guo ; Bin Wei ; Yuanchang Luo ; Hao Yang

In this work, we submitted our systems to the low-resource track of the IWSLT 2024 Speech Translation Campaign. Our systems tackled the unconstrained condition of the Dialectal Arabic North Levantine (ISO-3 code: apc) to English language pair. We proposed a cascaded solution consisting of an automatic speech recognition (ASR) model and a machine translation (MT) model. It was noted that the ASR model employed the pre-trained Whisper-large-v3 model to process the speech data, while the MT model adopted the Transformer architecture. To improve the quality of the MT model, it was stated that our system utilized not only the data provided by the competition but also an additional 54 million parallel sentences. Ultimately, we reported that our final system achieved a BLEU score of 24.7 for apc-to-English translation.

#22 CMU’s IWSLT 2024 Offline Speech Translation System: A Cascaded Approach For Long-Form Robustness [PDF] [Copy] [Kimi] [REL]

Authors: Brian Yan ; Patrick Fernandes ; Jinchuan Tian ; Siqi Ouyang ; William Chen ; Karen Livescu ; Lei Li ; Graham Neubig ; Shinji Watanabe

This work describes CMU’s submission to the IWSLT 2024 Offline Speech Translation (ST) Shared Task for translating English speech to German, Chinese, and Japanese text. We are the first participants to employ a long-form strategy which directly processes unsegmented recordings without the need for a separate voice-activity detection stage (VAD). We show that the Whisper automatic speech recognition (ASR) model has a hallucination problem when applied out-of-the-box to recordings containing non-speech noises, but a simple noisy fine-tuning approach can greatly enhance Whisper’s long-form robustness across multiple domains. Then, we feed English ASR outputs into fine-tuned NLLB machine translation (MT) models which are decoded using COMET-based Minimum Bayes Risk. Our VAD-free ASR+MT cascade is tested on TED talks, TV series, and workout videos and shown to outperform prior winning IWSLT submissions and large open-source models.

#23 NAIST Simultaneous Speech Translation System for IWSLT 2024 [PDF] [Copy] [Kimi] [REL]

Authors: Yuka Ko ; Ryo Fukuda ; Yuta Nishikawa ; Yasumasa Kano ; Tomoya Yanagita ; Kosuke Doi ; Mana Makinae ; Haotian Tan ; Makoto Sakai ; Sakriani Sakti ; Katsuhito Sudoh ; Satoshi Nakamura

This paper describes NAIST’s submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-German, Japanese, Chinese speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.

#24 Blending LLMs into Cascaded Speech Translation: KIT’s Offline Speech Translation System for IWSLT 2024 [PDF] [Copy] [Kimi] [REL]

Authors: Sai Koneru ; Thai Binh Nguyen ; Ngoc-Quan Pham ; Danni Liu ; Zhaolin Li ; Alexander Waibel ; Jan Niehues

Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT’s offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we integrate Mistral-7B into our system to enhance it in two ways. Firstly, we refine the ASR outputs by utilizing the N-best lists generated by our system and fine-tuning the LLM to predict the transcript accurately. Secondly, we refine the MT outputs at the document level by fine-tuning the LLM, leveraging both ASR and MT predictions to improve translation quality. We find that integrating the LLM into the ASR and MT systems results in an absolute improvement of 0.3% in Word Error Rate and 0.65% in COMET for tst2019 test set. In challenging test sets with overlapping speakers and background noise, we find that integrating LLM is not beneficial due to poor ASR performance. Here, we use ASR with chunked long-form decoding to improve context usage that may be unavailable when transcribing with Voice Activity Detection segmentation alone.

#25 ALADAN at IWSLT24 Low-resource Arabic Dialectal Speech Translation Task [PDF] [Copy] [Kimi] [REL]

Authors: Waad Ben Kheder ; Josef Jon ; André Beyer ; Abdel Messaoudi ; Rabea Affan ; Claude Barras ; Maxim Tychonov ; Jean-Luc Gauvain

This paper presents ALADAN’s approach to the IWSLT 2024 Dialectal and Low-resource shared task, focusing on Levantine Arabic (apc) and Tunisian Arabic (aeb) to English speech translation (ST). Addressing challenges such as the lack of standardized orthography and limited training data, we propose a solution for data normalization in Dialectal Arabic, employing a modified Levenshtein distance and Word2vec models to find orthographic variants of the same word. Our system consists of a cascade ST system integrating two ASR systems (TDNN-F and Zipformer) and two NMT modules derived from pre-trained models (NLLB-200 1.3B distilled model and CohereAI’s Command-R). Additionally, we explore the integration of unsupervised textual and audio data, highlighting the importance of multi-dialectal datasets for both ASR and NMT tasks. Our system achieves BLEU score of 31.5 for Levantine Arabic on the official validation set.