INTERSPEECH.2020 - Language and Multimodal

Total: 119

#1 Toward Silent Paralinguistics: Speech-to-EMG — Retrieving Articulatory Muscle Activity from Speech [PDF] [Copy] [Kimi1]

Authors: Catarina Botelho ; Lorenz Diener ; Dennis Küster ; Kevin Scheck ; Shahin Amiriparian ; Björn W. Schuller ; Tanja Schultz ; Alberto Abad ; Isabel Trancoso

Electromyographic (EMG) signals recorded during speech production encode information on articulatory muscle activity and also on the facial expression of emotion, thus representing a speech-related biosignal with strong potential for paralinguistic applications. In this work, we estimate the electrical activity of the muscles responsible for speech articulation directly from the speech signal. To this end, we first perform a neural conversion of speech features into electromyographic time domain features, and then attempt to retrieve the original EMG signal from the time domain features. We propose a feed forward neural network to address the first step of the problem (speech features to EMG features) and a neural network composed of a convolutional block and a bidirectional long short-term memory block to address the second problem (true EMG features to EMG signal). We observe that four out of the five originally proposed time domain features can be estimated reasonably well from the speech signal. Further, the five time domain features are able to predict the original speech-related EMG signal with a concordance correlation coefficient of 0.663. We further compare our results with the ones achieved on the inverse problem of generating acoustic speech features from EMG features.

#2 Multimodal Deception Detection Using Automatically Extracted Acoustic, Visual, and Lexical Features [PDF] [Copy] [Kimi1]

Authors: Jiaxuan Zhang ; Sarah Ita Levitan ; Julia Hirschberg

Deception detection in conversational dialogue has attracted much attention in recent years. Yet existing methods for this rely heavily on human-labeled annotations that are costly and potentially inaccurate. In this work, we present an automated system that utilizes multimodal features for conversational deception detection, without the use of human annotations. We study the predictive power of different modalities and combine them for better performance. We use openSMILE to extract acoustic features after applying noise reduction techniques to the original audio. Facial landmark features are extracted from the visual modality. We experiment with training facial expression detectors and applying Fisher Vectors to encode sequences of facial landmarks with varying length. Linguistic features are extracted from automatic transcriptions of the data. We examine the performance of these methods on the Box of Lies dataset of deception game videos, achieving 73% accuracy using features from all modalities. This result is significantly better than previous results on this corpus which relied on manual annotations, and also better than human performance.

#3 Multi-Modal Attention for Speech Emotion Recognition [PDF] [Copy] [Kimi1]

Authors: Zexu Pan ; Zhaojie Luo ; Jichen Yang ; Haizhou Li

Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to makes use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.

#4 WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition [PDF] [Copy] [Kimi1]

Authors: Guang Shen ; Riwei Lai ; Rui Chen ; Yu Zhang ; Kejia Zhang ; Qilong Han ; Hongtao Song

While having numerous real-world applications, speech emotion recognition is still a technically challenging problem. How to effectively leverage the inherent multiple modalities in speech data (e.g., audio and text) is key to accurate classification. Existing studies normally choose to fuse multimodal features at the utterance level and largely neglect the dynamic interplay of features from different modalities at a fine-granular level over time. In this paper, we explicitly model dynamic interactions between audio and text at the word level via interaction units between two long short-term memory networks representing audio and text. We also devise a hierarchical representation of audio information from the frame, phoneme and word levels, which largely improves the expressiveness of resulting audio features. We finally propose WISE, a novel word-level interaction-based multimodal fusion framework for speech emotion recognition, to accommodate the aforementioned components. We evaluate WISE on the public benchmark IEMOCAP corpus and demonstrate that it outperforms state-of-the-art methods.

#5 A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition [PDF] [Copy] [Kimi1]

Authors: Ming Chen ; Xudong Zhao

Speech emotion recognition (SER) is a challenging task that requires to learn suitable features for achieving good performance. The development of deep learning techniques makes it possible to automatically extract features rather than construct hand-crafted features. In this paper, a multi-scale fusion framework named STSER is proposed for bimodal SER by using speech and text information. A smodel, which takes advantage of convolutional neural network (CNN), bi-directional long short-term memory (Bi-LSTM) and the attention mechanism, is proposed to learn speech representation from the log-mel spectrogram extracted from speech data. Specifically, the CNN layers are utilized to learn local correlations. Then the Bi-LSTM layer is applied to learn long-term dependencies and contextual information. Finally, the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A tmodel using a pre-trained ALBERT model is applied for learning text representation from text data. Finally, a multi-scale fusion strategy, including feature fusion and ensemble learning, is applied to improve the overall performance. Experiments conducted on the public emotion dataset IEMOCAP have shown that the proposed STSER can achieve comparable recognition accuracy with fewer feature inputs.

#6 Group Gated Fusion on Attention-Based Bidirectional Alignment for Multimodal Emotion Recognition [PDF] [Copy] [Kimi1]

Authors: Pengfei Liu ; Kun Li ; Helen Meng

Emotion recognition is a challenging and actively-studied research area that plays a critical role in emotion-aware human-computer interaction systems. In a multimodal setting, temporal alignment between different modalities has not been well investigated yet. This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states to explicitly capture the alignment relationship between speech and text, and a novel group gated fusion (GGF) layer to integrate the representations of different modalities. We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly, and the proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.

#7 Multi-Modal Embeddings Using Multi-Task Learning for Emotion Recognition [PDF] [Copy] [Kimi1]

Authors: Aparna Khare ; Srinivas Parthasarathy ; Shiva Sundaram

General embeddings like word2vec, GloVe and ELMo have shown a lot of success in natural language tasks. The embeddings are typically extracted from models that are built on general tasks such as skip-gram models and natural language generation. In this paper, we extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for machine learning tasks. The embeddings in our network are extracted using the encoder of a transformer model trained using multi-task training. We use person identification and automatic speech recognition as the tasks in our embedding generation framework. We tune and evaluate the embeddings on the downstream task of emotion recognition and demonstrate that on the CMU-MOSEI dataset, the embeddings can be used to improve over previous state of the art results.

#8 Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network [PDF] [Copy] [Kimi1]

Authors: Jeng-Lin Li ; Chi-Chun Lee

Integrating multimodal emotion sensing modules in realizing human-centered technologies is rapidly growing. Despite recent advancement of deep architectures in improving recognition performances, inability to handle individual differences in the expressive cues creates a major hurdle for real world applications. In this work, we propose a Speaker-aligned Graph Memory Network (SaGMN) that leverages the use of speaker embedding learned from a large speaker verification network to characterize such an individualized personal difference across speakers. Specifically, the learning of the gated memory block is jointly optimized with a speaker graph encoder which aligns similar vocal characteristics samples together while effectively enlarge the discrimination across emotion classes. We evaluate our multimodal emotion recognition network on the CMU-MOSEI database and achieve a state-of-art accuracy of 65.1% UAR and 74.7% F1 score. Further visualization experiments demonstrate the effect of speaker space alignment with the use of graph memory blocks.

#9 Context-Dependent Domain Adversarial Neural Network for Multimodal Emotion Recognition [PDF] [Copy] [Kimi1]

Authors: Zheng Lian ; Jianhua Tao ; Bin Liu ; Jian Huang ; Zhanlei Yang ; Rongjun Li

Emotion recognition remains a complex task due to speaker variations and low-resource training samples. To address these difficulties, we focus on the domain adversarial neural networks (DANN) for emotion recognition. The primary task is to predict emotion labels. The secondary task is to learn a common representation where speaker identities can not be distinguished. By using this approach, we bring the representations of different speakers closer. Meanwhile, through using the unlabeled data in the training process, we alleviate the impact of low-resource training samples. In the meantime, prior work found that contextual information and multimodal features are important for emotion recognition. However, previous DANN based approaches ignore these information, thus limiting their performance. In this paper, we propose the context-dependent domain adversarial neural network for multimodal emotion recognition. To verify the effectiveness of our proposed method, we conduct experiments on the benchmark dataset IEMOCAP. Experimental results demonstrate that the proposed method shows an absolute improvement of 3.48% over state-of-the-art strategies.

#10 ATCSpeech: A Multilingual Pilot-Controller Speech Corpus from Real Air Traffic Control Environment [PDF] [Copy] [Kimi1]

Authors: Bo Yang ; Xianlong Tan ; Zhengmao Chen ; Bing Wang ; Min Ruan ; Dan Li ; Zhongping Yang ; Xiping Wu ; Yi Lin

Automatic Speech Recognition (ASR) technique has been greatly developed in recent years, which expedites many applications in other fields. For the ASR research, speech corpus is always an essential foundation, especially for the vertical industry, such as Air Traffic Control (ATC). There are some speech corpora for common applications, public or paid. However, for the ATC domain, it is difficult to collect raw speeches from real systems due to safety issues. More importantly, annotating the transcription is a more laborious work for the supervised learning ASR task, which hugely restricts the prospect of ASR application. In this paper, a multilingual speech corpus (ATCSpeech) from real ATC systems, including accented Mandarin Chinese and English speeches, is built and released to encourage the non-commercial ASR research in the ATC domain. The corpus is detailly introduced from the perspective of data amount, speaker gender and role, speech quality and other attributions. In addition, the performance of baseline ASR models is also reported. A community edition for our speech database can be applied and used under a special contract. To our best knowledge, this is the first work that aims at building a real and multilingual ASR corpus for the ATC related research.

#11 Developing an Open-Source Corpus of Yoruba Speech [PDF] [Copy] [Kimi1]

Authors: Alexander Gutkin ; Işın Demirşahin ; Oddur Kjartansson ; Clara Rivera ; Kọ́lá Túbọ̀sún

This paper introduces an open-source speech dataset for Yoruba — one of the largest low-resource West African languages spoken by at least 22 million people. Yoruba is one of the official languages of Nigeria, Benin and Togo, and is spoken in other neighboring African countries and beyond. The corpus consists of over four hours of 48 kHz recordings from 36 male and female volunteers and the corresponding transcriptions that include disfluency annotation. The transcriptions have full diacritization, which is vital for pronunciation and lexical disambiguation. The annotated speech dataset described in this paper is primarily intended for use in text-to-speech systems, serve as adaptation data in automatic speech recognition and speech-to-speech translation, and provide insights in West African corpus linguistics. We demonstrate the use of this corpus in a simple statistical parametric speech synthesis (SPSS) scenario evaluating it against the related languages from the CMU Wilderness dataset and the Yoruba Lagos-NWU corpus.

#12 ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers [PDF] [Copy] [Kimi1]

Authors: Jung-Woo Ha ; Kihyun Nam ; Jingu Kang ; Sang-Woo Lee ; Sohee Yang ; Hyunhoon Jung ; Hyeji Kim ; Eunmi Kim ; Soojin Kim ; Hyun Ah Kim ; Kyoungtae Doh ; Chan Kyu Lee ; Nako Sung ; Sunghun Kim

Automatic speech recognition (ASR) via call is essential for various applications, including AI for contact center (AICC) services. Despite the advancement of ASR, however, most publicly available call-based speech corpora such as Switchboard are old-fashioned. Also, most existing call corpora are in English and mainly focus on open domain dialog or general scenarios such as audiobooks. Here we introduce a new large-scale Korean call-based speech corpus under a goal-oriented dialog scenario from more than 11,000 people, i.e., ClovaCall corpus. ClovaCall includes approximately 60,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain. We validate the effectiveness of our dataset with intensive experiments using two standard ASR models. Furthermore, we release our ClovaCall dataset and baseline source codes to be available via

#13 LAIX Corpus of Chinese Learner English: Towards a Benchmark for L2 English ASR [PDF] [Copy] [Kimi1]

Authors: Yanhong Wang ; Huan Luan ; Jiahong Yuan ; Bin Wang ; Hui Lin

This paper introduces a corpus of Chinese Learner English containing 82 hours of L2 English speech by Chinese learners from all major dialect regions, collected through mobile apps developed by LAIX Inc. The LAIX corpus was created to serve as a benchmark dataset for evaluating Automatic Speech Recognition (ASR) performance on L2 English, the first of this kind as far as we know. The paper describes our effort to build the corpus, including corpus design, data selection and transcription. Multiple rounds of quality check were conducted in the transcription process. Transcription errors were analyzed in terms of error types, rounds of reviewing, and learners’ proficiency levels. Word error rates of state-of-the-art ASR systems on the benchmark corpus were also reported.

#14 Design and Development of a Human-Machine Dialog Corpus for the Automated Assessment of Conversational English Proficiency [PDF] [Copy] [Kimi1]

Author: Vikram Ramanarayanan

This paper presents a carefully designed corpus of scored spoken conversations between English language learners and a dialog system to facilitate research and development of both human and machine scoring of dialog interactions. We collected speech, demographic and user experience data from non-native speakers of English who interacted with a virtual boss as part of a workplace pragmatics skill building application. Expert raters then scored the dialogs on a custom rubric encompassing 12 aspects of conversational proficiency as well as an overall holistic performance score. We analyze key corpus statistics and discuss the advantages of such a corpus for both human and machine scoring.

#15 CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment [PDF] [Copy] [Kimi1]

Authors: Si-Ioi Ng ; Cymie Wing-Yee Ng ; Jiarui Wang ; Tan Lee ; Kathy Yuet-Sheung Lee ; Michael Chi-Fai Tong

This paper describes the design and development of CUCHILD, a large-scale Cantonese corpus of child speech. The corpus contains spoken words collected from 1,986 child speakers aged from 3 to 6 years old. The speech materials include 130 words of 1 to 4 syllables in length. The speakers cover both typically developing (TD) children and children with speech disorder. The intended use of the corpus is to support scientific and clinical research, as well as technology development related to child speech assessment. The design of the corpus, including selection of words, participants recruitment, data acquisition process, and data pre-processing are described in detail. The results of acoustical analysis are presented to illustrate the properties of child speech. Potential applications of the corpus in automatic speech recognition, phonological error detection and speaker diarization are also discussed.

#16 FinChat: Corpus and Evaluation Setup for Finnish Chat Conversations on Everyday Topics [PDF] [Copy] [Kimi1]

Authors: Katri Leino ; Juho Leinonen ; Mittul Singh ; Sami Virpioja ; Mikko Kurimo

Creating open-domain chatbots requires large amounts of conversational data and related benchmark tasks to evaluate them. Standardized evaluation tasks are crucial for creating automatic evaluation metrics for model development; otherwise, comparing the models would require resource-expensive human evaluation. While chatbot challenges have recently managed to provide a plethora of such resources for English, resources in other languages are not yet available. In this work, we provide a starting point for Finnish open-domain chatbot research. We describe our collection efforts to create the Finnish chat conversation corpus FinChat, which is made available publicly. FinChat includes unscripted conversations on seven topics from people of different ages. Using this corpus, we also construct a retrieval-based evaluation task for Finnish chatbot development. We observe that off-the-shelf chatbot models trained on conversational corpora do not perform better than chance at choosing the right answer based on automatic metrics, while humans can do the same task almost perfectly. Similarly, in a human evaluation, responses to questions from the evaluation set generated by the chatbots are predominantly marked as incoherent. Thus, FinChat provides a challenging evaluation set, meant to encourage chatbot development in Finnish.

#17 DiPCo — Dinner Party Corpus [PDF] [Copy] [Kimi1]

Authors: Maarten Van Segbroeck ; Ahmed Zaid ; Ksenia Kutsenko ; Cirenia Huerta ; Tinh Nguyen ; Xuewen Luo ; Björn Hoffmeister ; Jan Trmal ; Maurizio Omologo ; Roland Maas

We present a speech data corpus that simulates a “dinner party” scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.

#18 Learning to Detect Bipolar Disorder and Borderline Personality Disorder with Language and Speech in Non-Clinical Interviews [PDF] [Copy] [Kimi1]

Authors: Bo Wang ; Yue Wu ; Niall Taylor ; Terry Lyons ; Maria Liakata ; Alejo J. Nevado-Holgado ; Kate E.A. Saunders

Bipolar disorder (BD) and borderline personality disorder (BPD) are both chronic psychiatric disorders. However, their overlapping symptoms and common comorbidity make it challenging for the clinicians to distinguish the two conditions on the basis of a clinical interview. In this work, we first present a new multi-modal dataset containing interviews involving individuals with BD or BPD being interviewed about a non-clinical topic . We investigate the automatic detection of the two conditions, and demonstrate a good linear classifier that can be learnt using a down-selected set of features from the different aspects of the interviews and a novel approach of summarising these features. Finally, we find that different sets of features characterise BD and BPD, thus providing insights into the difference between the automatic screening of the two conditions.

#19 FT Speech: Danish Parliament Speech Corpus [PDF] [Copy] [Kimi1]

Authors: Andreas Kirkedal ; Marija Stepanović ; Barbara Plank

This paper introduces FT Speech, a new speech corpus created from the recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish, which are largely limited to read-aloud and dictation data. We outline design considerations, including the preprocessing methods and the alignment procedure. To evaluate the quality of the corpus, we train automatic speech recognition systems (ASR) on the new resource and compare them to the systems trained on the Danish part of Språkbanken, the largest public ASR corpus for Danish to date. Our baseline results show that we achieve a 14.01 WER on the new corpus. A combination of FT Speech with in-domain language data provides comparable results to models trained specifically on Språkbanken, showing that FT Speech transfers well to this data set. Interestingly, our results demonstrate that the opposite is not the case. This shows that FT Speech provides a valuable resource for promoting research on Danish ASR with more spontaneous speech.

#20 Metric Learning Loss Functions to Reduce Domain Mismatch in the x-Vector Space for Language Recognition [PDF] [Copy] [Kimi1]

Authors: Raphaël Duroselle ; Denis Jouvet ; Irina Illina

State-of-the-art language recognition systems are based on discriminative embeddings called x-vectors. Channel and gender distortions produce mismatch in such x-vector space where embeddings corresponding to the same language are not grouped in an unique cluster. To control this mismatch, we propose to train the x-vector DNN with metric learning objective functions. Combining a classification loss with the metric learning n-pair loss allows to improve the language recognition performance. Such a system achieves a robustness comparable to a system trained with a domain adaptation loss function but without using the domain information. We also analyze the mismatch due to channel and gender, in comparison to language proximity, in the x-vector space. This is achieved using the Maximum Mean Discrepancy divergence measure between groups of x-vectors. Our analysis shows that using the metric learning loss function reduces gender and channel mismatch in the x-vector space, even for languages only observed on one channel in the train set.

#21 The XMUSPEECH System for the AP19-OLR Challenge [PDF] [Copy] [Kimi1]

Authors: Zheng Li ; Miao Zhao ; Jing Li ; Yiming Zhi ; Lin Li ; Qingyang Hong

In this paper, we present our XMUSPEECH system for the oriental language recognition (OLR) challenge, AP19-OLR. The challenge this year contained three tasks: (1) short-utterance LID, (2) cross-channel LID, and (3) zero-resource LID. We leveraged the system pipeline from three aspects, including front-end training, back-end processing, and fusion strategy. We implemented many encoder networks for Tasks 1 and 3, such as extended x-vector, multi-task learning x-vector with phonetic information, and our previously presented multi-feature integration structure. Furthermore, our previously proposed length expansion method was used in the test set for Task 1. I-vector systems based on different acoustic features were built for the cross-channel task. For all of three tasks, the same back-end procedure was used for the sake of stability but with different settings for three tasks. Finally, the greedy fusion strategy helped to choose the subsystems to compose the final fusion systems (submitted systems). Cavg values of 0.0263, 0.2813, and 0.1697 from the development set for Task 1, 2, and 3 were obtained from our submitted systems, and we achieved rank 3rd, 3rd, and 1st in the three tasks in this challenge, respectively.

#22 On the Usage of Multi-Feature Integration for Speaker Verification and Language Identification [PDF] [Copy] [Kimi1]

Authors: Zheng Li ; Miao Zhao ; Jing Li ; Lin Li ; Qingyang Hong

In this paper, we study the technology of multiple acoustic feature integration for the applications of Automatic Speaker Verification (ASV) and Language Identification (LID). In contrast to score level fusion, a common method for integrating subsystems built upon various acoustic features, we explore a new integration strategy, which integrates multiple acoustic features based on the x-vector framework. The frame level, statistics pooling level, segment level, and embedding level integrations are investigated in this study. Our results indicate that frame level integration of multiple acoustic features achieves the best performance in both speaker and language recognition tasks, and the multi-feature integration strategy can be generalized in both classification tasks. Furthermore, we introduce a time-restricted attention mechanism into the frame level integration structure to further improve the performance of multi-feature integration. The experiments are conducted on VoxCeleb 1 for ASV and AP-OLR-17 for LID, and we achieve 28% and 19% relative improvement in terms of Equal Error Rate (EER) in ASV and LID tasks, respectively.

#23 What Does an End-to-End Dialect Identification Model Learn About Non-Dialectal Information? [PDF] [Copy] [Kimi1]

Authors: Shammur A. Chowdhury ; Ahmed Ali ; Suwon Shon ; James Glass

An end-to-end dialect identification system generates the likelihood of each dialect, given a speech utterance. The performance relies on its capabilities to discriminate the acoustic properties between the different dialects, even though the input signal contains non-dialectal information such as speaker and channel. In this work, we study how non-dialectal information are encoded inside the end-to-end dialect identification model. We design several proxy tasks to understand the model’s ability to represent speech input for differentiating non-dialectal information — such as (a) gender and voice identity of speakers, (b) languages, (c) channel (recording and transmission) quality — and compare with dialectal information (i.e., predicting geographic region of the dialects). By analyzing non-dialectal representations from layers of an end-to-end Arabic dialect identification (ADI) model, we observe that the model retains gender and channel information throughout the network while learning a speaker-invariant representation. Our findings also suggest that the CNN layers of the end-to-end model mirror feature extractors capturing voice-specific information, while the fully-connected layers encode more dialectal information.

#24 Releasing a Toolkit and Comparing the Performance of Language Embeddings Across Various Spoken Language Identification Datasets [PDF] [Copy] [Kimi2]

Authors: Matias Lindgren ; Tommi Jauhiainen ; Mikko Kurimo

In this paper, we propose a software toolkit for easier end-to-end training of deep learning based spoken language identification models across several speech datasets. We apply our toolkit to implement three baseline models, one speaker recognition model, and three x-vector architecture variations, which are trained on three datasets previously used in spoken language identification experiments. All models are trained separately on each dataset (closed task) and on a combination of all datasets (open task), after which we compare if the open task training yields better language embeddings. We begin by training all models end-to-end as discriminative classifiers of spectral features, labeled by language. Then, we extract language embedding vectors from the trained end-to-end models, train separate Gaussian Naive Bayes classifiers on the vectors, and compare which model provides best language embeddings for the backend classifier. Our experiments show that the open task condition leads to improved language identification performance on only one of the datasets. In addition, we discovered that increasing x-vector model robustness with random frequency channel dropout significantly reduces its end-to-end classification performance on the test set, while not affecting back-end classification performance of its embeddings. Finally, we note that two baseline models consistently outperformed all other models.

#25 Learning Intonation Pattern Embeddings for Arabic Dialect Identification [PDF] [Copy] [Kimi2]

Authors: Aitor Arronte Alvarez ; Elsayed Sabry Abdelaal Issa

This article presents a full end-to-end pipeline for Arabic Dialect Identification (ADI) using intonation patterns and acoustic representations. Recent approaches to language and dialect identification use linguistic-aware deep architectures that are able to capture phonetic differences amongst languages and dialects. Specifically, in ADI tasks, different combinations of linguistic features and acoustic representations have been successful with deep learning models. The approach presented in this article uses intonation patterns and hybrid residual and bidirectional LSTM networks to learn acoustic embeddings with no additional linguistic information. Results of the experiments show that intonation patterns for Arabic dialects provide sufficient information to achieve state-of-the-art results on the VarDial 17 ADI dataset, outperforming single-feature systems. The pipeline presented is robust to data sparsity, in contrast to other deep learning approaches that require large quantities of data. We conjecture on the importance of sufficient information as a criterion for optimality in a deep learning ADI task, and more generally, its application to acoustic modeling problems. Small intonation patterns, when sufficient in an information-theoretic sense, allow deep learning architectures to learn more accurate speech representations.