IWSLT.2023 - Papers | Cool Papers - Immersive Paper Discovery

#1 FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN [PDF¹] [Copy] [Kimi³]

Authors: Milind Agarwal ; Sweta Agrawal ; Antonios Anastasopoulos ; Luisa Bentivogli ; Ondřej Bojar ; Claudia Borg ; Marine Carpuat ; Roldano Cattoni ; Mauro Cettolo ; Mingda Chen ; William Chen ; Khalid Choukri ; Alexandra Chronopoulou ; Anna Currey ; Thierry Declerck ; Qianqian Dong ; Kevin Duh ; Yannick Estève ; Marcello Federico ; Souhir Gahbiche ; Barry Haddow ; Benjamin Hsu ; Phu Mon Htut ; Hirofumi Inaguma ; Dávid Javorský ; John Judge ; Yasumasa Kano ; Tom Ko ; Rishu Kumar ; Pengwei Li ; Xutai Ma ; Prashant Mathur ; Evgeny Matusov ; Paul McNamee ; John P. McCrae ; Kenton Murray ; Maria Nadejde ; Satoshi Nakamura ; Matteo Negri ; Ha Nguyen ; Jan Niehues ; Xing Niu ; Atul Kr. Ojha ; John E. Ortega ; Proyag Pal ; Juan Pino ; Lonneke van der Plas ; Peter Polák ; Elijah Rippeth ; Elizabeth Salesky ; Jiatong Shi ; Matthias Sperber ; Sebastian Stüker ; Katsuhito Sudoh ; Yun Tang ; Brian Thompson ; Kevin Tran ; Marco Turchi ; Alex Waibel ; Mingxuan Wang ; Shinji Watanabe ; Rodolfo Zevallos

This paper reports on the shared tasks organized by the 20th IWSLT Conference. The shared tasks address 9 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, multilingual, dialect and low-resource speech translation, and formality control. The shared tasks attracted a total of 38 submissions by 31 teams. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

#2 Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Terminology [PDF¹] [Copy] [Kimi²]

Authors: Elizabeth Salesky ; Kareem Darwish ; Mohamed Al-Badrashiny ; Mona Diab ; Jan Niehues

We present the ACL 60/60 evaluation sets for multilingual translation of ACL 2022 technical presentations into 10 target languages. This dataset enables further research into multilingual speech translation under realistic recording conditions with unsegmented audio and domain-specific terminology, applying NLP tools to text and speech in the technical domain, and evaluating and improving model robustness to diverse speaker demographics.

#3 The MineTrans Systems for IWSLT 2023 Offline Speech Translation and Speech-to-Speech Translation Tasks [PDF] [Copy] [Kimi¹]

Authors: Yichao Du ; Guo Zhengsheng ; Jinchuan Tian ; Zhirui Zhang ; Xing Wang ; Jianwei Yu ; Zhaopeng Tu ; Tong Xu ; Enhong Chen

This paper presents the extscMineTrans English-to-Chinese speech translation systems developed for two challenge tracks of IWSLT 2023, i.e., Offline Speech Translation (S2T) and Speech-to-Speech Translation (S2ST). For the S2T track, extscMineTrans employs a practical cascaded system to explore the limits of translation performance in both constrained and unconstrained settings, where the whole system consists of automatic speech recognition (ASR), punctuation recognition (PC), and machine translation (MT) modules. We also investigate the effectiveness of multiple ASR architectures and explore two MT strategies: supervised in-domain fine-tuning and prompt-guided translation using a large language model. For the S2ST track, we explore a speech-to-unit (S2U) framework to build an end-to-end S2ST system. This system encodes the target speech as discrete units via our trained HuBERT. Then it leverages the standard sequence-to-sequence model to directly learn the mapping between source speech and discrete units without any auxiliary recognition tasks (i.e., ASR and MT tasks). Various efforts are made to improve the extscMineTrans’s performance, such as acoustic model pre-training on large-scale data, data filtering, data augmentation, speech segmentation, knowledge distillation, consistency training, model ensembles, etc.

#4 Improving End-to-End Speech Translation by Imitation-Based Knowledge Distillation with Synthetic Transcripts [PDF] [Copy] [Kimi¹]

Authors: Rebekka Hubert ; Artem Sokolov ; Stefan Riezler

End-to-end automatic speech translation (AST) relies on data that combines audio inputs with text translation outputs. Previous work used existing large parallel corpora of transcriptions and translations in a knowledge distillation (KD) setup to distill a neural machine translation (NMT) into an AST student model. While KD allows using larger pretrained models, the reliance of previous KD approaches on manual audio transcripts in the data pipeline restricts the applicability of this framework to AST. We present an imitation learning approach where a teacher NMT system corrects the errors of an AST student without relying on manual transcripts. We show that the NMT teacher can recover from errors in automatic transcriptions and is able to correct erroneous translations of the AST student, leading to improvements of about 4 BLEU points over the standard AST end-to-end baseline on the English-German CoVoST-2 and MuST-C datasets, respectively. Code and data are publicly available: https://github.com/HubReb/imitkd_ast/releases/tag/v1.1

#5 The USTC’s Dialect Speech Translation System for IWSLT 2023 [PDF] [Copy] [Kimi¹]

Authors: Pan Deng ; Shihao Chen ; Weitai Zhang ; Jie Zhang ; Lirong Dai

This paper presents the USTC system for the IWSLT 2023 Dialectal and Low-resource shared task, which involves translation from Tunisian Arabic to English. We aim to investigate the mutual transfer between Tunisian Arabic and Modern Standard Arabic (MSA) to enhance the performance of speech translation (ST) by following standard pre-training and fine-tuning pipelines. We synthesize a substantial amount of pseudo Tunisian-English paired data using a multi-step pre-training approach. Integrating a Tunisian-MSA translation module into the end-to-end ST model enables the transfer from Tunisian to MSA and facilitates linguistic normalization of the dialect. To increase the robustness of the ST system, we optimize the model’s ability to adapt to ASR errors and propose a model ensemble method. Results indicate that applying the dialect transfer method can increase the BLEU score of dialectal ST. It is shown that the optimal system ensembles both cascaded and end-to-end ST models, achieving BLEU improvements of 2.4 and 2.8 in test1 and test2 sets, respectively, compared to the best published system.

#6 KIT’s Multilingual Speech Translation System for IWSLT 2023 [PDF] [Copy] [Kimi¹]

Authors: Danni Liu ; Thai Binh Nguyen ; Sai Koneru ; Enes Yavuz Ugan ; Ngoc-Quan Pham ; Tuan Nam Nguyen ; Tu Anh Dinh ; Carlos Mullov ; Alexander Waibel ; Jan Niehues

Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which focuses on the translation of scientific conference talks. The test condition features accented input speech and terminology-dense contents. The tasks requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.

#7 The BIGAI Offline Speech Translation Systems for IWSLT 2023 Evaluation [PDF] [Copy] [Kimi¹]

Author: Zhihang Xie

This paper describes the BIGAI’s submission to IWSLT 2023 Offline Speech Translation task on three language tracks from English to Chinese, German and Japanese. The end-to-end systems are built upon a Wav2Vec2 model for speech recognition and mBART50 models for machine translation. An adapter module is applied to bridge the speech module and the translation module. The CTC loss between speech features and source token sequence is incorporated during training. Experiments show that the systems can generate reasonable translations on three languages. The proposed models achieve BLEU scores of 22.3 for en→de, 10.7 for en→ja and 33.0 for en→zh on tst2023 TED datasets. However, the performance is decreased by a significant margin on complex scenarios like persentations and interview.

#8 Enhancing Video Translation Context with Object Labels [PDF] [Copy] [Kimi¹]

Authors: Jeremy Gwinnup ; Tim Anderson ; Brian Ore ; Eric Hansen ; Kevin Duh

We present a simple yet efficient method to enhance the quality of machine translation models trained on multimodal corpora by augmenting the training text with labels of detected objects in the corresponding video segments. We then test the effects of label augmentation in both baseline and two automatic speech recognition (ASR) conditions. In contrast with multimodal techniques that merge visual and textual features, our modular method is easy to implement and the results are more interpretable. Comparisons are made with Transformer translation architectures trained with baseline and augmented labels, showing improvements of up to +1.0 BLEU on the How2 dataset.

#9 Length-Aware NMT and Adaptive Duration for Automatic Dubbing [PDF] [Copy] [Kimi¹]

Authors: Zhiqiang Rao ; Hengchao Shang ; Jinlong Yang ; Daimeng Wei ; Zongyao Li ; Jiaxin Guo ; Shaojun Li ; Zhengzhe Yu ; Zhanglin Wu ; Yuhao Xie ; Bin Wei ; Jiawei Zheng ; Lizhi Lei ; Hao Yang

This paper presents the submission of Huawei Translation Services Center for the IWSLT 2023 dubbing task in the unconstrained setting. The proposed solution consists of a Transformer-based machine translation model and a phoneme duration predictor. The Transformer is deep and multiple target-to-source length-ratio class labels are used to control target lengths. The variation predictor in FastSpeech2 is utilized to predict phoneme durations. To optimize the isochrony in dubbing, re-ranking and scaling are performed. The source audio duration is used as a reference to re-rank the translations of different length-ratio labels, and the one with minimum time deviation is preferred. Additionally, the phoneme duration outputs are scaled within a defined threshold to narrow the duration gap with the source audio.

#10 NAVER LABS Europe’s Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource Track [PDF] [Copy] [Kimi¹]

Authors: Edward Gow-Smith ; Alexandre Berard ; Marcely Zanon Boito ; Ioan Calapodescu

This paper presents NAVER LABS Europe’s systems for Tamasheq-French and Quechua-Spanish speech translation in the IWSLT 2023 Low-Resource track. Our work attempts to maximize translation quality in low-resource settings using multilingual parameter-efficient solutions that leverage strong pre-trained models. Our primary submission for Tamasheq outperforms the previous state of the art by 7.5 BLEU points on the IWSLT 2022 test set, and achieves 23.6 BLEU on this year’s test set, outperforming the second best participant by 7.7 points. For Quechua, we also rank first and achieve 17.7 BLEU, despite having only two hours of translation data. Finally, we show that our proposed multilingual architecture is also competitive for high-resource languages, outperforming the best unconstrained submission to the IWSLT 2021 Multilingual track, despite using much less training data and compute.

#11 Direct Models for Simultaneous Translation and Automatic Subtitling: FBK@IWSLT2023 [PDF] [Copy] [Kimi¹]

Authors: Sara Papi ; Marco Gaido ; Matteo Negri

This paper describes the FBK’s participation in the Simultaneous Translation and Automatic Subtitling tracks of the IWSLT 2023 Evaluation Campaign. Our submission focused on the use of direct architectures to perform both tasks: for the simultaneous one, we leveraged the knowledge already acquired by offline-trained models and directly applied a policy to obtain the real-time inference; for the subtitling one, we adapted the direct ST model to produce well-formed subtitles and exploited the same architecture to produce timestamps needed for the subtitle synchronization with audiovisual content. Our English-German SimulST system shows a reduced computational-aware latency compared to the one achieved by the top-ranked systems in the 2021 and 2022 rounds of the task, with gains of up to 3.5 BLEU. Our automatic subtitling system outperforms the only-existing solution based on a direct system by 3.7 and 1.7 SubER in English-German and English-Spanish respectively.

#12 MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation [PDF] [Copy] [Kimi¹]

Authors: Dominik Macháček ; Ondřej Bojar ; Raj Dabre

There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate simultaneous speech translation (SST) but their correlations with human ratings of SST, which has been recently collected as Continuous Ratings (CR), are unclear. In this paper, we leverage the evaluations of candidate systems submitted to the English-German SST task at IWSLT 2022 and conduct an extensive correlation analysis of CR and the aforementioned metrics. Our study reveals that the offline metrics are well correlated with CR and can be reliably used for evaluating machine translation in simultaneous mode, with some limitations on the test set size. We conclude that given the current quality levels of SST, these metrics can be used as proxies for CR, alleviating the need for large scale human evaluation. Additionally, we observe that correlations of the metrics with translation as a reference is significantly higher than with simultaneous interpreting, and thus we recommend the former for reliable evaluation.

#13 Improving Neural Machine Translation Formality Control with Domain Adaptation and Reranking-based Transductive Learning [PDF] [Copy] [Kimi¹]

Authors: Zhanglin Wu ; Zongyao Li ; Daimeng Wei ; Hengchao Shang ; Jiaxin Guo ; Xiaoyu Chen ; Zhiqiang Rao ; Zhengzhe Yu ; Jinlong Yang ; Shaojun Li ; Yuhao Xie ; Bin Wei ; Jiawei Zheng ; Ming Zhu ; Lizhi Lei ; Hao Yang ; Yanfei Jiang

This paper presents Huawei Translation Service Center (HW-TSC)’s submission on the IWSLT 2023 formality control task, which provides two training scenarios: supervised and zero-shot, each containing two language pairs, and sets constrained and unconstrained conditions. We train the formality control models for these four language pairs under these two conditions respectively, and submit the corresponding translation results. Our efforts are divided into two fronts: enhancing general translation quality and improving formality control capability. According to the different requirements of the formality control task, we use a multi-stage pre-training method to train a bilingual or multilingual neural machine translation (NMT) model as the basic model, which can improve the general translation quality of the base model to a relatively high level. Then, under the premise of affecting the general translation quality of the basic model as little as possible, we adopt domain adaptation and reranking-based transductive learning methods to improve the formality control capability of the model.

#14 HW-TSC at IWSLT2023: Break the Quality Ceiling of Offline Track via Pre-Training and Domain Adaptation [PDF] [Copy] [Kimi¹]

Authors: Zongyao Li ; Zhanglin Wu ; Zhiqiang Rao ; Xie YuHao ; Guo JiaXin ; Daimeng Wei ; Hengchao Shang ; Wang Minghan ; Xiaoyu Chen ; Zhengzhe Yu ; Li ShaoJun ; Lei LiZhi ; Hao Yang

This paper presents HW-TSC’s submissions to the IWSLT 2023 Offline Speech Translation task, including speech translation of talks from English to German, Chinese, and Japanese, respectively. We participate in all three conditions (constrained training, constrained with large language models training, and unconstrained training) with models of cascaded architectures. We use data enhancement, pre-training models and other means to improve the ASR quality, and R-Drop, deep model, domain data selection, etc. to improve the translation quality. Compared with last year’s best results, we achieve 2.1 BLEU improvement on the MuST-C English-German test set.

#15 Submission of USTC’s System for the IWSLT 2023 - Offline Speech Translation Track [PDF] [Copy] [Kimi]

Authors: Xinyuan Zhou ; Jianwei Cui ; Zhongyi Ye ; Yichi Wang ; Luzhen Xu ; Hanyi Zhang ; Weitai Zhang ; Lirong Dai

This paper describes the submissions of the research group USTC-NELSLIP to the 2023 IWSLT Offline Speech Translation competition, which involves translating spoken English into written Chinese. We utilize both cascaded models and end-to-end models for this task. To improve the performance of the cascaded models, we introduce Whisper to reduce errors in the intermediate source language text, achieving a significant improvement in ASR recognition performance. For end-to-end models, we propose Stacked Acoustic-and-Textual En- coding extension (SATE-ex), which feeds the output of the acoustic decoder into the textual decoder for information fusion and to prevent error propagation. Additionally, we improve the performance of the end-to-end system in translating speech by combining the SATE-ex model with the encoder-decoder model through ensembling.

#16 I2R’s End-to-End Speech Translation System for IWSLT 2023 Offline Shared Task [PDF] [Copy] [Kimi¹]

Authors: Muhammad Huzaifah ; Kye Min Tan ; Richeng Duan

This paper describes I2R’s submission to the offline speech translation track for IWSLT 2023. We focus on an end-to-end approach for translation from English audio to German text, one of the three available language directions in this year’s edition. The I2R system leverages on pretrained models that have been exposed to large-scale audio and text data for our base model. We introduce several stages of additional pretraining followed by fine-tuning to adapt the system for the downstream speech translation task. The strategy is supplemented by other techniques such as data augmentation, domain tagging, knowledge distillation, and model ensemble, among others. We evaluate the system on several publicly available test sets for comparison.

#17 The NiuTrans End-to-End Speech Translation System for IWSLT23 English-to-Chinese Offline Task [PDF] [Copy] [Kimi¹]

Authors: Yuchen Han ; Xiaoqian Liu ; Hao Chen ; Yuhao Zhang ; Chen Xu ; Tong Xiao ; Jingbo Zhu

This paper describes the NiuTrans end-to-end speech translation system submitted for the IWSLT 2023 English-to-Chinese offline task. Our speech translation models are composed of pre-trained ASR and MT models under the SATE framework. Several pre-trained models with diverse architectures and input representations (e.g., log Mel-filterbank and waveform) were utilized. We proposed an IDA method to iteratively improve the performance of the MT models and generate the pseudo ST data through MT systems. We then trained ST models with different structures and data settings to enhance ensemble performance. Experimental results demonstrate that our NiuTrans system achieved a BLEU score of 29.22 on the MuST-C En-Zh tst-COMMON set, outperforming the previous year’s submission by 0.12 BLEU despite using less MT training data.

#18 ON-TRAC Consortium Systems for the IWSLT 2023 Dialectal and Low-resource Speech Translation Tasks [PDF] [Copy] [Kimi¹]

Authors: Antoine Laurent ; Souhir Gahbiche ; Ha Nguyen ; Haroun Elleuch ; Fethi Bougares ; Antoine Thiol ; Hugo Riguidel ; Salima Mdhaffar ; Gaëlle Laperrière ; Lucas Maison ; Sameer Khurana ; Yannick Estève

This paper describes the ON-TRAC consortium speech translation systems developed for IWSLT 2023 evaluation campaign. Overall, we participated in three speech translation tracks featured in the low-resource and dialect speech translation shared tasks, namely; i) spoken Tamasheq to written French, ii) spoken Pashto to written French, and iii) spoken Tunisian to written English. All our primary submissions are based on the end-to-end speech-to-text neural architecture using a pretrained SAMU-XLSR model as a speech encoder and a mbart model as a decoder. The SAMU-XLSR model is built from the XLS-R 128 in order to generate language agnostic sentence-level embeddings. This building is driven by the LaBSE model trained on multilingual text dataset. This architecture allows us to improve the input speech representations and achieve significant improvements compared to conventional end-to-end speech translation systems.

#19 BUT Systems for IWSLT 2023 Marathi - Hindi Low Resource Speech Translation Task [PDF] [Copy] [Kimi¹]

Authors: Santosh Kesiraju ; Karel Beneš ; Maksim Tikhonov ; Jan Černocký

This paper describes the systems submitted for Marathi to Hindi low-resource speech translation task. Our primary submission is based on an end-to-end direct speech translation system, whereas the contrastive one is a cascaded system. The backbone of both the systems is a Hindi-Marathi bilingual ASR system trained on 2790 hours of imperfect transcribed speech. The end-to-end speech translation system was directly initialized from the ASR, and then fine-tuned for direct speech translation with an auxiliary CTC loss for translation. The MT model for the cascaded system is initialized from a cross-lingual language model, which was then fine-tuned using 1.6 M parallel sentences. All our systems were trained from scratch on publicly available datasets. In the end, we use a language model to re-score the n-best hypotheses. Our primary submission achieved 30.5 and 39.6 BLEU whereas the contrastive system obtained 21.7 and 28.6 BLEU on official dev and test sets respectively. The paper also presents the analysis on several experiments that were conducted and outlines the strategies for improving speech translation in low-resource scenarios.

#20 CMU’s IWSLT 2023 Simultaneous Speech Translation System [PDF] [Copy] [Kimi¹]

Authors: Brian Yan ; Jiatong Shi ; Soumi Maiti ; William Chen ; Xinjian Li ; Yifan Peng ; Siddhant Arora ; Shinji Watanabe

This paper describes CMU’s submission to the IWSLT 2023 simultaneous speech translation shared task for translating English speech to both German text and speech in a streaming fashion. We first build offline speech-to-text (ST) models using the joint CTC/attention framework. These models also use WavLM front-end features and mBART decoder initialization. We adapt our offline ST models for simultaneous speech-to-text translation (SST) by 1) incrementally encoding chunks of input speech, re-computing encoder states for each new chunk and 2) incrementally decoding output text, pruning beam search hypotheses to 1-best after processing each chunk. We then build text-to-speech (TTS) models using the VITS framework and achieve simultaneous speech-to-speech translation (SS2ST) by cascading our SST and TTS models.

#21 Improving Low Resource Speech Translation with Data Augmentation and Ensemble Strategies [PDF] [Copy] [Kimi¹]

Authors: Akshaya Vishnu Kudlu Shanbhogue ; Ran Xue ; Soumya Saha ; Daniel Zhang ; Ashwinkumar Ganesan

This paper describes the speech translation system submitted as part of the IWSLT 2023 shared task on low resource speech translation. The low resource task aids in building models for language pairs where the training corpus is limited. In this paper, we focus on two language pairs, namely, Tamasheq-French (Tmh→Fra) and Marathi-Hindi (Mr→Hi) and implement a speech translation system that is unconstrained. We evaluate three strategies in our system: (a) Data augmentation where we perform different operations on audio as well as text samples, (b) an ensemble model that integrates a set of models trained using a combination of augmentation strategies, and (c) post-processing techniques where we explore the use of large language models (LLMs) to improve the quality of sentences that are generated. Experiments show how data augmentation can relatively improve the BLEU score by 5.2% over the baseline system for Tmh→Fra while an ensemble model further improves performance by 17% for Tmh→Fra and 23% for Mr→Hi task.

#22 Speech Translation with Style: AppTek’s Submissions to the IWSLT Subtitling and Formality Tracks in 2023 [PDF] [Copy] [Kimi¹]

Authors: Parnia Bahar ; Patrick Wilken ; Javier Iranzo-Sánchez ; Mattia Di Gangi ; Evgeny Matusov ; Zoltán Tüske

AppTek participated in the subtitling and formality tracks of the IWSLT 2023 evaluation. This paper describes the details of our subtitling pipeline - speech segmentation, speech recognition, punctuation prediction and inverse text normalization, text machine translation and direct speech-to-text translation, intelligent line segmentation - and how we make use of the provided subtitling-specific data in training and fine-tuning. The evaluation results show that our final submissions are competitive, in particular outperforming the submissions by other participants by 5% absolute as measured by the SubER subtitle quality metric. For the formality track, we participate with our En-Ru and En-Pt production models, which support formality control via prefix tokens. Except for informal Portuguese, we achieve near perfect formality level accuracy while at the same time offering high general translation quality.

#23 QUESPA Submission for the IWSLT 2023 Dialect and Low-resource Speech Translation Tasks [PDF] [Copy] [Kimi¹]

Authors: John E. Ortega ; Rodolfo Zevallos ; William Chen

This article describes the QUESPA team speech translation (ST) submissions for the Quechua to Spanish (QUE–SPA) track featured in the Evaluation Campaign of IWSLT 2023: low-resource and dialect speech translation. Two main submission types were supported in the campaign: constrained and unconstrained. We submitted six total systems of which our best (primary) constrained system consisted of an ST model based on the Fairseq S2T framework where the audio representations were created using log mel-scale filter banks as features and the translations were performed using a transformer. The best (primary) unconstrained system used a pipeline approach which combined automatic speech recognition (ASR) with machine translation (MT). The ASR transcriptions for the best unconstrained system were computed using a pre-trained XLS-R-based model along with a fine-tuned language model. Transcriptions were translated using a MT system based on a fine-tuned, pre-trained language model (PLM). The four other submissions are presented in this article (2 constrained and 2 unconstrained) for comparison because they consist of various architectures. Our results show that direct ST (ASR and MT combined together) can be more effective than a PLM in a low-resource (constrained) setting for Quechua to Spanish. On the other hand, we show that fine-tuning of any type on both the ASR and MT system is worthwhile, resulting in nearly 16 BLEU for the unconstrained task.

#24 GMU Systems for the IWSLT 2023 Dialect and Low-resource Speech Translation Tasks [PDF] [Copy] [Kimi²]

Authors: Jonathan Mbuya ; Antonios Anastasopoulos

This paper describes the GMU Systems for the IWSLT 2023 Dialect and Low-resource Speech Translation Tasks. We submitted systems for five low-resource tasks and the dialectal task. In this work, we explored self-supervised pre-trained speech models and finetuned them on speech translation downstream tasks. We use the Wav2vec 2.0, XLSR-53, and Hubert as self-supervised models. Unlike Hubert, Wav2vec 2.0 and XLSR-53 achieve the best results when we remove the top three layers. Our results show that Wav2vec 2.0 and Hubert perform similarly with their relative best configuration. In addition, we found that Wav2vec 2.0 pre-trained on audio data of the same language as the source language of a speech translation model achieves better results. For the low-resource setting, the best results are achieved using either the Wav2vec 2.0 or Hubert models, while XLSR-53 achieves the best results for the dialectal transfer task. We find that XLSR-53 does not perform well for low-resource tasks. Using Wav2vec 2.0, we report close to 2 BLEU point improvements on the test set for the Tamasheq-French compared to the baseline system at the IWSLT 2022.

#25 The HW-TSC’s Speech-to-Speech Translation System for IWSLT 2023 [PDF] [Copy] [Kimi¹]

Authors: Minghan Wang ; Yinglu Li ; Jiaxin Guo ; Zongyao Li ; Hengchao Shang ; Daimeng Wei ; Min Zhang ; Shimin Tao ; Hao Yang

This paper describes our work on the IWSLT2023 Speech-to-Speech task. Our proposed cascaded system consists of an ensemble of Conformer and S2T-Transformer-based ASR models, a Transformer-based MT model, and a Diffusion-based TTS model. Our primary focus in this competition was to investigate the modeling ability of the Diffusion model for TTS tasks in high-resource scenarios and the role of TTS in the overall S2S task. To this end, we proposed DTS, an end-to-end diffusion-based TTS model that takes raw text as input and generates waveform by iteratively denoising on pure Gaussian noise. Compared to previous TTS models, the speech generated by DTS is more natural and performs better in code-switching scenarios. As the training process is end-to-end, it is relatively straightforward. Our experiments demonstrate that DTS outperforms other TTS models on the GigaS2S benchmark, and also brings positive gains for the entire S2S system.