INTERSPEECH.2022 - Language and Multimodal

| Total: 146

#1 SHAS: Approaching optimal Segmentation for End-to-End Speech Translation [PDF] [Copy] [Kimi1] [REL]

Authors: Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa, Marta R. Costa-jussà

Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments. Speech translation datasets provide manual segmentations of the audios, which are not available in real-world scenarios, and existing segmentation methods usually significantly reduce translation quality at inference time. To bridge the gap between the manual segmentation of training and the automatic one at inference, we propose Supervised Hybrid Audio Segmentation (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus. First, we train a classifier to identify the included frames in a segmentation, using speech representations from a pre-trained wav2vec 2.0. The optimal splitting points are then found by a probabilistic Divide-and-Conquer algorithm that progressively splits at the frame of lowest probability until all segments are below a pre-specified length. Experiments on MuST-C and mTEDx show that the translation of the segments produced by our method approaches the quality of the manual segmentation on 5 languages pairs. Namely, SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods. Our method is additionally generalizable to different domains and achieves high zero-shot performance in unseen languages.


#2 M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [PDF] [Copy] [Kimi1] [REL]

Authors: Jinming Zhao, Hao Yang, Gholamreza Haffari, Ehsan Shareghi

End-to-end speech-to-text translation models are often initialized with pre-trained speech encoder and pre-trained text decoder. This leads to a significant training gap between pre-training and fine-tuning, largely due to the modality differences between speech outputs from the encoder and text inputs to the decoder. In this work, we aim to bridge the modality gap between speech and text to improve translation quality. We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text. While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation via modelling global and local dependencies of a speech sequence. Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU score on the Must-C En$\rightarrow$DE dataset.\footnote{Our code is available at https://github.com/mingzi151/w2v2-st.}


#3 Cross-Modal Decision Regularization for Simultaneous Speech Translation [PDF] [Copy] [Kimi1] [REL]

Authors: Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim

Simultaneous translation systems start producing the output while processing the partial source sentence in the incoming input stream. These systems need to decide when to read more input and when to write the output. The decisions taken by the model depend on the structure of source/target language and the information contained in the partial input sequence. Hence, read/write decision policy remains the same across different input modalities, i.e., speech and text. This motivates us to leverage the text transcripts corresponding to the speech input for improving simultaneous speech-to-text translation (SimulST). We propose Cross-Modal Decision Regularization (CMDR) to improve the decision policy of SimulST systems by using the simultaneous text-to-text translation (SimulMT) task. We also extend several techniques from the offline speech translation domain to explore the role of SimulMT task in improving SimulST performance. Overall, we achieve 34.66% / 4.5 BLEU improvement over the baseline model across different latency regimes for the MuST-C English-German (EnDe) SimulST task.


#4 Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation [PDF] [Copy] [Kimi1] [REL]

Authors: Ryo Fukuda, Katsuhito Sudoh, Satoshi Nakamura

Speech segmentation, which splits long speech into short segments, is essential for speech translation (ST). Popular VAD tools like WebRTC VAD have generally relied on pause-based segmentation. Unfortunately, pauses in speech do not necessarily match sentence boundaries, and sentences can be connected by a very short pause that is difficult to detect by VAD. In this study, we propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus. We also propose a hybrid method that combines VAD and the above speech segmentation method. Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods. The hybrid approach further improved the translation performance.


#5 Generalized Keyword Spotting using ASR embeddings [PDF] [Copy] [Kimi] [REL]

Authors: Kirandevraj R, Vinod Kumar Kurmi, Vinay Namboodiri, C V Jawahar

Keyword Spotting (KWS) detects a set of pre-defined spoken keywords. Building a KWS system for an arbitrary set requires massive training datasets. We propose to use the text transcripts from an Automatic Speech Recognition (ASR) system alongside triplets for KWS training. The intermediate representation from the ASR system trained on a speech corpus is used as acoustic word embeddings for keywords. Triplet loss is added to the Connectionist Temporal Classification (CTC) loss in the ASR while training. This method achieves an Average Precision (AP) of 0.843 over 344 words unseen by the model trained on the TIMIT dataset. In contrast, the Multi-View recurrent method that learns jointly on the text and acoustic embeddings achieves only 0.218 for out-of-vocabulary words. This method is also applied to low-resource languages such as Tamil by converting Tamil characters to English using transliteration. This is a very challenging novel task for which we provide a dataset of transcripts for the keywords. Despite our model not generalizing well, we achieve a benchmark AP of 0.321 on over 38 words unseen by the model on the MSWC Tamil keyword set. The model also produces an accuracy of 96.2% for classification tasks on the Google Speech Commands dataset.


#6 Reducing Offensive Replies in Open Domain Dialogue Systems [PDF] [Copy] [Kimi1] [REL]

Authors: Naokazu Uchida, Takeshi Homma, Makoto Iwayama, Yasuhiro Sogawa

In recent years, a series of open-domain dialogue systems using large-scale language models have been proposed. These dialogue systems are attracting business attention because these do significantly natural and diverse dialogues with humans. However, it has been noted that these dialogue systems reflect gender, race, and other biases inherent in the data and may generate offensive replies or replies that agree with offensive utterances. This study examined a dialogue system that outputs appropriate replies to offensive utterances. Specifically, our system incorporates multiple dialogue models, each of which is specialized to suppress offensive replies in a specific category, then selects the most non-offensive reply from the outputs of the models. We evaluated the utility of our system when suppressing offensive replies of DialoGPT. We confirmed ours reduces the offensive replies to less than 1%, whereas one of the state-of-the-art suppressing methods reduces to 9.8%.


#7 Induce Spoken Dialog Intents via Deep Unsupervised Context Contrastive Clustering [PDF] [Copy] [Kimi1] [REL]

Authors: Ting-Wei Wu, Biing Juang

Intent detection is one of most critical tasks in spoken language understanding. However, most systems could only identify a predefined set of intents, without covering a ubiquitous space of real-world semantics. Discovering new dialog intents with clustering to explore additional requests is crucial particularly in complex domains like customer support services. Leveraging the strong coherence between the user query utterance and their following contexts in the dialog, we present an effective intent induction approach with fine-tuning and clustering with contrastive learning. In particular, we first transform pretrained LMs into conversational encoders with in-domain dialogs. Then we conduct context-aware contrastive learning to reveal latent intent semantics via the coherence from dialog contexts. After obtaining the initial representations on both views of the query and their contexts, we propose a novel clustering method to iteratively refine the representation by minimizing semantic distances between pairs of utterances or contexts, under the same cluster assignment on the opposite view. The experimental results validate the robustness and versatility of our framework, which also achieves superior performances over competitive baselines without the label supervision.


#8 Dialogue Acts Aided Important Utterance Detection Based on Multiparty and Multimodal Information [PDF] [Copy] [Kimi1] [REL]

Authors: Fumio Nihei, Ryo Ishii, Yukiko Nakano, Kyosuke Nishida, Ryo Masumura, Atsushi Fukayama, Takao Nakamura

It has been reported that visualization of important utterances in a meeting enables efficient understanding of the meeting. Therefore, creating a model to estimate important utterances and improving its performance is an important issue. Several studies have reported that introducing auxiliary tasks as estimation targets improves the estimation performance of the main task. In this study, we develop estimation models of important utterances using dialogue acts (DAs) as an auxiliary task. The MATRICS corpus of four-party face-to-face meetings was used as the analysis data. A transformer with historical information was used for the model to estimate important utterances, and three types of modal information (text, audio, and video) were used as input data. In addition, audio and video data were separated into information about the speaker and others. As a result, the best model for important utterances was the one that used the speaker's text and audio, as well as others' audio and video data, with the assistance of DAs, with an estimation performance of 0.809 in f-measure. The results also showed that the model performed better than the one that only estimates important utterances, indicating that the assistance of DAs is effective in the estimation of important utterances.


#9 Contextual Acoustic Barge-In Classification for Spoken Dialog Systems [PDF] [Copy] [Kimi1] [REL]

Authors: Dhanush Bekal, Sundararajan Srinivasan, Srikanth Ronanki, Sravan Bodapati, Katrin Kirchhoff

In this work, we define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins. Following the success of pre-trained models, we use low-level speech representations from a self-supervised representation learning model for our downstream classification task. Further, we propose a novel technique to infuse lexical information directly into speech representations to improve the domain-specific language information implicitly learned during pre-training. Experiments conducted on spoken dialog data show that our proposed model trained to validate barge-in entirely from speech representations is faster by 38% relative and achieves 4.5% relative F1 score improvement over a baseline LSTM model that uses both audio and Automatic Speech Recognition (ASR) 1-best hypotheses. On top of this, our best proposed model with lexically infused representations along with contextual features provides a fur- ther relative improvement of 5.7% in the F1 score but only 22% faster than the baseline.


#10 Calibrate and Refine! A Novel and Agile Framework for ASR Error Robust Intent Detection [PDF] [Copy] [Kimi1] [REL]

Authors: Peilin Zhou, Dading Chong, Helin Wang, Qingcheng Zeng

The past ten years have witnessed the rapid development of text-based intent detection, whose benchmark performances have already been taken to a remarkable level by deep learning techniques. However, automatic speech recognition (ASR) errors are inevitable in real-world applications due to the environment noise, unique speech patterns and etc, leading to sharp performance drop in state-of-the-art text-based intent detection models. Essentially, this phenomenon is caused by the semantic drift brought by ASR errors and most existing works tend to focus on designing new model structures to reduce its impact, which is at the expense of versatility and flexibility. Different from previous one-piece model, in this paper, we propose a novel and agile framework called CR-ID for ASR error robust intent detection with two plug-and-play modules, namely semantic drift calibration module (SDCM) and phonemic refinement module (PRM), which are both model-agnostic and thus could be easily integrated to any existing intent detection models without mod-ifying their structures. Experimental results on SNIPS dataset show that, our proposed CR-ID framework achieves competi-tive performance and outperform all the baseline methods on ASR outputs, which verifies that CR-ID can effectively allevi-ate the semantic drift caused by ASR errors.


#11 ASR-Robust Natural Language Understanding on ASR-GLUE dataset [PDF] [Copy] [Kimi1] [REL]

Authors: Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng

In recent years, with the increasing demand for voice interface applications, more and more attention has been paid to language understanding in speech systems. These speech-based intelligent systems usually comprise an automatic speech recognition (ASR) component and a natural language understanding (NLU) component which takes the output of the ASR component as input. Despite the rapid development of speech recognition over the past few decades, recognition errors are still inevitable, especially in noisy environments. However, the robustness of natural language understanding (NLU) systems to errors introduced by ASR is under-examined. In this paper, we propose three empirical approaches to improve the robustness of the NLU models. The first one is ASR correction which attempts to make error corrections for the mistranscriptions. The later two methods focus on simulating a noisy training scenario to train more robust NLU models. Extensive experimental results and analyses show that the proposed methods can effectively improve the robustness of NLU models.


#12 From Disfluency Detection to Intent Detection and Slot Filling [PDF] [Copy] [Kimi1] [REL]

Authors: Mai Hoang Dao, Thinh Truong, Dat Quoc Nguyen

We present the first empirical study investigating the influence of disfluency detection on downstream tasks of intent detection and slot filling. We perform this study for Vietnamese---a low-resource language that has no previous study as well as no public dataset available for disfluency detection. First, we extend the fluent Vietnamese intent detection and slot filling dataset PhoATIS by manually adding contextual disfluencies and annotating them. Then, we conduct experiments using strong baselines for disfluency detection and joint intent detection and slot filling, which are based on pre-trained language models. We find that: (i) disfluencies produce negative effects on the performances of the downstream intent detection and slot filling tasks, and (ii) in the disfluency context, the pre-trained multilingual language model XLM-R helps produce better intent detection and slot filling performances than the pre-trained monolingual language model PhoBERT, and this is opposite to what generally found in the fluency context.


#13 Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis [PDF] [Copy] [Kimi1] [REL]

Authors: Hengshun Zhou, Jun Du, Gongzhen Zou, Zhaoxu Nian, Chin-Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe, Odette Scharenborg, Jingdong Chen, Shifu Xiong, Jian-Qing Gao

In this paper, we describe and release publicly the audio-visual wake word spotting (WWS) database in the MISP2021 Challenge, which covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The database and the code are released, which will be a valuable addition to the community for promoting WWS research using multi-modality information in realistic and complex conditions. Moreover, we investigated the different data augmentation methods for single modalities on an end-to-end WWS network. A set of audio-visual fusion experiments and analysis were conducted to observe the assistance from visual information to acoustic information based on different audio and video field configurations. The results showed that the fusion system generally improves over the single-modality (audio- or video-only) system, especially under complex noisy conditions.


#14 Extending Compositional Attention Networks for Social Reasoning in Videos [PDF] [Copy] [Kimi1] [REL]

Authors: Christina Sartzetaki, Georgios Paraskevopoulos, Alexandros Potamianos

We propose a novel deep architecture for the task of reasoning about social interactions in videos. We leverage the multistep reasoning capabilities of Compositional Attention Networks (MAC) [1], and propose a multimodal extension (MAC-X). MAC-X is based on a recurrent cell that performs iterative mid-level fusion of input modalities (visual, auditory, text) over multiple reasoning steps, by use of a temporal attention mechanism. We then combine MAC-X with LSTMs for temporal input processing in an end-to-end architecture. Our ablation studies show that the proposed MAC-X architecture can effectively leverage multimodal input cues using mid-level fusion mechanisms. We apply MAC-X to the task of Social Video Question Answering in the Social IQ dataset and obtain a 2.5% absolute improvement in terms of binary accuracy over the current state-of-the-art. Index Terms: Video Question Answering, Social Reasoning, Compositional Attention Networks, MAC


#15 TopicKS: Topic-driven Knowledge Selection for Knowledge-grounded Dialogue Generation [PDF] [Copy] [Kimi1] [REL]

Authors: Shiquan Wang, Yuke Si, Xiao Wei, Longbiao Wang, Zhiqiang Zhuang, Xiaowang Zhang, Jianwu Dang

Knowledge-grounded dialogue generation is proposed to solve the problem of general or meaningless responses in traditional end-to-end dialogue generation methods. It generally includes two sub-modules: knowledge selection and knowledge-aware generation. Most studies consider the topic information for knowledge-aware generation, while ignoring it in knowledge selection. It may cause the topic mismatch between the overall dialogue and the selected knowledge, leading to the inconsistency of the generated response and the context. Therefore, in this study, we propose a Topic-driven Knowledge Selection method (TopicKS) to exploit topic information both in knowledge selection and knowledge-aware generation. Specifically, under the guidance of topic information, TopicKS selects more accurate candidate knowledge for the current turn of dialogue based on context information and historical knowledge information. Then the decoder uses the context information and selected knowledge to generate a higher-quality response under the guidance of topic information. Experiments on the notable benchmark corpus Wizard of Wikipedia (WoW) show that our proposed method not only achieves a significant improvement in terms of selection accuracy rate on knowledge selection, but also outperforms the baseline model in terms of the quality of the generated responses.


#16 Bottom-up discovery of structure and variation in response tokens (‘backchannels’) across diverse languages [PDF] [Copy] [Kimi1] [REL]

Authors: Andreas Liesenfeld, Mark Dingemanse

Response tokens (also known as backchannels, continuers, or feedback) are a frequent feature of human interaction, where they serve to display understanding and streamline turn-taking. We propose a bottom-up method to study responsive behaviour across 16 languages (8 language families). We use sequential context and recurrence of turns formats to identify candidate response tokens in a language-agnostic way across diverse conversational corpora. We then use UMAP clustering directly on speech signals to represent structure and variation. We find that (i) written orthographic annotations underrepresent the attested variation, (ii) distinctions between formats can be gradient rather than discrete, (iii) most languages appear to make available a broad distinction between a minimal nasal format ‘mm' and a fuller ‘yeah'-like format. Charting this aspect of human interaction contributes to our understanding of interactional infrastructure across languages and can inform the design of speech technologies.


#17 Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding [PDF] [Copy] [Kimi1] [REL]

Authors: Yi Zhu, Zexun Wang, Hang Liu, Peiying Wang, Mingchao Feng, Meng Chen, Xiaodong He

End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning model for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.


#18 Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism [PDF] [Copy] [Kimi1] [REL]

Authors: Keiko Ochi, Nobutaka Ono, Keiho Owada, Kuroda Miho, Shigeki Sagayama, Hidenori Yamasue

Autism spectral disorder (ASD) is a highly prevalent neurodevelopmental disorder characterized by deficits in communication and social interaction. Head-nodding, a kind of visual backchannels, is used to co-construct the conversation and is crucial to smooth social interaction. In the present study, we quantitively analyze how head-nodding relates to speech turn-taking and prosodic change in Japanese conversation. The results showed that nodding was less frequently observed in ASD participants, especially around speakers' turn transitions, whereas it was notable just before and after turn-taking in individuals with typical development (TD). Analysis using 16 sec of long-time sliding segments revealed that synchronization between nod frequency and mean vocal intensity was higher in the TD group than in the ASD group. Classification by a support vector machine (SVM) using these proposed features achieved high performance with an accuracy of 91.1% and an F-measure of 0.942. In addition, the results indicated an optimal way of nodding according to turn-ending and emphasis, which could provide standard responses for reference or feedback in social skill training for people with ASD. Furthermore, the natural timing of nodding implied by the results can also be applied to developing interactive responses in humanoid robots or computer graphic (CG) agents.


#19 SF-DST: Few-Shot Self-Feeding Reading Comprehension Dialogue State Tracking with Auxiliary Task [PDF] [Copy] [Kimi1] [REL]

Authors: Jihyun Lee, Gary Geunbae Lee

Few-shot dialogue state tracking (DST) model tracks user requests in dialogue with reliable accuracy even with a small amount of data. In this paper, we introduce an ontology-free few-shot DST with self-feeding belief state input. The self-feeding belief state input increases the accuracy in multi-turn dialogue by summarizing previous dialogue. Also, we newly developed a slot-gate auxiliary task. This new auxiliary task helps classify whether a slot is mentioned in the dialogue. Our model achieved the best score in a few-shot setting for four domains on multiWOZ 2.0.


#20 Benchmarking Transformers-based models on French Spoken Language Understanding tasks [PDF] [Copy] [Kimi1] [REL]

Authors: Oralie Cattan, Sahar Ghannay, Christophe Servan, Sophie Rosset

In the last five years, the rise of the self-attentional Transformer-based architectures led to state-of-the-art performances over many natural language tasks. Although these approaches are increasingly popular, they require large amounts of data and computational resources. There is still a substantial need for benchmarking methodologies ever upwards on under-resourced languages in data-scarce application conditions. Most pre-trained language models were massively studied using the English language and only a few of them were evaluated on French. In this paper, we propose a unified benchmark, focused on evaluating models quality and their ecological impact on two well-known French spoken language understanding tasks. Especially we benchmark thirteen well-established Transformer-based models on the two available spoken language understanding tasks for French: MEDIA and ATIS-FR. Within this framework, we show that compact models can reach comparable results to bigger ones while their ecological impact is considerably lower. However, this assumption is nuanced and depends on the considered compression method.


#21 mcBERT: Momentum Contrastive Learning with BERT for Zero-Shot Slot Filling [PDF] [Copy] [Kimi1] [REL]

Authors: Seong-Hwan Heo, WonKee Lee, Jong-Hyeok Lee

Zero-shot slot filling has received considerable attention to cope with the problem of limited available data for the target domain. One of the important factors in zero-shot learning is to make the model learn generalized and reliable representations. For this purpose, we present mcBERT, which stands for 'm'omentum 'c'ontrastive learning with BERT, to develop a robust zero-shot slot filling model. mcBERT uses BERT to initialize the two encoders, the query encoder and key encoder, and is trained by applying momentum contrastive learning. Our experimental results on the SNIPS benchmark show that mcBERT substantially outperforms the previous models, recording a new state-of-the-art. Besides, we also show that each component composing mcBERT contributes to the performance improvement.


#22 Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding [PDF] [Copy] [Kimi1] [REL]

Authors: Pu Wang, Hugo Van hamme

End-to-end spoken language understanding (SLU) systems benefit from pretraining on large corpora, followed by fine-tuning on application-specific data. The resulting models are too large for on-edge applications. For instance, BERT-based systems contain over 110M parameters. Observing the model is over-parameterized, we propose lean transformer structure where the dimension of the attention mechanism is automatically reduced using group sparsity. We propose a variant where the learned attention subspace is transferred to an attention bottleneck layer. In a low-resource setting and without pre-training, the resulting compact SLU model achieves accuracies competitive with pre-trained large models.


#23 On joint training with interfaces for spoken language understanding [PDF] [Copy] [Kimi1] [REL]

Authors: Anirudh Raju, Milind Rao, Gautam Tiwari, PRANAV DHERAM, Bryan Anderson, Zhe Zhang, Chul Lee, Bach Bui, Ariya Rastrow

Spoken language understanding (SLU) systems extract both text transcripts and semantics associated with intents and slots from input speech utterances. SLU systems usually consist of (1) an automatic speech recognition (ASR) module (2) an interface module that exposes relevant outputs from ASR, and (3) a natural language understanding (NLU) module. Interfaces in SLU systems carry information on text transcriptions or richer information like neural embeddings from ASR to NLU. In this paper, we study how interfaces affect joint-training for spoken language understanding. Most notably, we obtain the state-of-the-art results on the publicly available 50-hr SLURP [1] dataset. We first leverage large-size pretrained ASR and NLU models that are connected by a text interface, and then jointly train both models via a sequence loss function. For scenarios where pretrained models are not utilized, the best results are obtained through a joint sequence loss training using richer neural interfaces. Finally, we show the overall diminishing impact of leveraging pretrained models with increased training data size.


#24 Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models [PDF] [Copy] [Kimi1] [REL]

Authors: Vineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed Hussen Abdelaziz, Erik Marchi, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a target keyword, inferring user intent in absence of keyword is difficult. This also poses a challenge when creating the training/evaluation data for such systems due to inherent ambiguity in the user's data. To this end, we propose a novel FTM approach that uses weakly-labeled training data obtained with a newly introduced data sampling strategy. While this sampling strategy reduces data annotation efforts, the data labels are noisy as the data are not annotated manually. We use these data to train an acoustics-only model for the FTM task by regularizing its loss function via knowledge distillation from an ASR-based (LatticeRNN) model. This improves the model decisions, resulting in 66% gain in accuracy, as measured by equal-error-rate (EER), over the base acoustics-only model. We also show that the ensemble of the LatticeRNN and acoustic-distilled models brings further accuracy improvement of 20%.


#25 Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation [PDF] [Copy] [Kimi1] [REL]

Authors: Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobu Morioka

End-to-end speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research. Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets. However, in practice, the performance of direct S2ST is bounded by the availability of paired S2ST training data. In this work, we explore multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron 2. With our most effective approaches, the average translation quality of direct S2ST on 21 language pairs on the CVSS-C corpus is improved by +13.6 BLEU (or +113% relatively), as compared to the previous state-of-the-art trained without additional data. The improvements on low-resource language are even more significant (+398% relatively on average). Our comparative studies suggest future research directions for S2ST and speech representation learning.