INTERSPEECH.2021 - Language and Multimodal

| Total: 101

#1 The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates [PDF] [Copy] [Kimi1] [REL]

Authors: Björn W. Schuller, Anton Batliner, Christian Bergler, Cecilia Mascolo, Jing Han, Iulia Lefter, Heysem Kaya, Shahin Amiriparian, Alice Baird, Lukas Stappen, Sandra Ottl, Maurice Gerczuk, Panagiotis Tzirakis, Chloë Brown, Jagmohan Chauhan, Andreas Grammenos, Apinan Hasthanasombat, Dimitris Spathis, Tong Xia, Pietro Cicuta, Leon J.M. Rothkrantz, Joeri A. Zwerts, Jelle Treep, Casper S. Kaandorp

The INTERSPEECH 2021 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the COVID-19 Cough and COVID-19 Speech Sub-Challenges, a binary classification on COVID-19 infection has to be made based on coughing sounds and speech; in the Escalation Sub-Challenge, a three-way assessment of the level of escalation in a dialogue is featured; and in the Primates Sub-Challenge, four species vs background need to be classified. We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the ‘usual’ ComParE and BoAW features as well as deep unsupervised representation learning using the auDeep toolkit, and deep feature extraction from pre-trained CNNs using the Deep Spectrum toolkit; in addition, we add deep end-to-end sequential modelling, and partially linguistic analysis.


#2 Transfer Learning-Based Cough Representations for Automatic Detection of COVID-19 [PDF] [Copy] [Kimi1] [REL]

Authors: Rubén Solera-Ureña, Catarina Botelho, Francisco Teixeira, Thomas Rolland, Alberto Abad, Isabel Trancoso

In the last months, there has been an increasing interest in developing reliable, cost-effective, immediate and easy to use machine learning based tools that can help health care operators, institutions, companies, etc. to optimize their screening campaigns. In this line, several initiatives emerged aimed at the automatic detection of COVID-19 from speech, breathing and coughs, with inconclusive preliminary results. The ComParE 2021 COVID-19 Cough Sub-challenge provides researchers from all over the world a suitable test-bed for the evaluation and comparison of their work. In this paper, we present the INESC-ID contribution to the ComParE 2021 COVID-19 Cough Sub-challenge. We leverage transfer learning to develop a set of three expert classifiers based on deep cough representation extractors. A calibrated decision-level fusion system provides the final classification of coughs recordings as either COVID-19 positive or negative. Results show unweighted average recalls of 72.3% and 69.3% in the development and test sets, respectively. Overall, the experimental assessment shows the potential of this approach although much more research on extended respiratory sounds datasets is needed.


#3 The Phonetic Footprint of Covid-19? [PDF] [Copy] [Kimi1] [REL]

Authors: P. Klumpp, T. Bocklet, T. Arias-Vergara, J.C. Vásquez-Correa, P.A. Pérez-Toro, S.P. Bayerl, J.R. Orozco-Arroyave, Elmar Nöth

Against the background of the ongoing pandemic, this year’s Computational Paralinguistics Challenge featured a classification problem to detect Covid-19 from speech recordings. The presented approach is based on a phonetic analysis of speech samples, thus it enabled us not only to discriminate between Covid and non-Covid samples, but also to better understand how the condition influenced an individual’s speech signal. Our deep acoustic model was trained with datasets collected exclusively from healthy speakers. It served as a tool for segmentation and feature extraction on the samples from the challenge dataset. Distinct patterns were found in the embeddings of phonetic classes that have their place of articulation deep inside the vocal tract. We observed profound differences in classification results for development and test splits, similar to the baseline method. We concluded that, based on our phonetic findings, it was safe to assume that our classifier was able to reliably detect a pathological condition located in the respiratory tract. However, we found no evidence to claim that the system was able to discriminate between Covid-19 and other respiratory diseases.


#4 Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021 [PDF] [Copy] [Kimi1] [REL]

Authors: Edresson Casanova, Arnaldo Candido Jr., Ricardo Corso Fernandes Jr., Marcelo Finger, Lucas Rafael Stefanel Gris, Moacir Antonelli Ponti, Daniel Peixoto Pinto da Silva

In this work, we propose several techniques to address data scarceness in ComParE 2021 COVID-19 identification tasks for the application of deep models such as Convolutional Neural Networks. Data is initially preprocessed into spectrogram or MFCC-gram formats. After preprocessing, we combine three different data augmentation techniques to be applied in model training. Then we employ transfer learning techniques from pretrained audio neural networks. Those techniques are applied to several distinct neural architectures. For COVID-19 identification in speech segments, we obtained competitive results. On the other hand, in the identification task based on cough data, we succeeded in producing a noticeable improvement on existing baselines, reaching 75.9% unweighted average recall (UAR).


#5 Visual Transformers for Primates Classification and Covid Detection [PDF] [Copy] [Kimi1] [REL]

Authors: Steffen Illium, Robert Müller, Andreas Sedlmeier, Claudia-Linnhoff Popien

We apply the vision transformer, a deep machine learning model build around the attention mechanism, on mel-spectrogram representations of raw audio recordings. When adding mel-based data augmentation techniques and sample-weighting, we achieve comparable performance on both (PRS and CCS challenge) tasks of ComParE21, outperforming most single model baselines. We further introduce overlapping vertical patching and evaluate the influence of parameter configurations.


#6 Deep-Learning-Based Central African Primate Species Classification with MixUp and SpecAugment [PDF] [Copy] [Kimi1] [REL]

Author: Thomas Pellegrini

In this paper, we report experiments in which we aim to automatically classify primate vocalizations according to four primate species of interest, plus a background category with forest sound events. We compare several standard deep neural networks architectures: standard deep convolutional neural networks (CNNs), MobileNets and ResNets. To tackle the small size of the training dataset, less than seven thousand audio files, the data augmentation techniques SpecAugment and MixUp proved to be very useful. Against the very unbalanced classes of the dataset, we used a balanced data sampler that showed to be efficient. An exponential moving average of the model weights allowed to get slight further gains. The best model was a standard 10-layer CNN, comprised of about five million parameters. It achieved a 93.6% Unweighted Average Recall (UAR) on the development set, and generalized well on the test set with a 92.5% UAR, outperforming an official baseline of 86.6%. We quantify the performance gains brought by the augmentations and training tricks, and report fusion and classification experiments based on embeddings that did not bring better results.


#7 A Deep and Recurrent Architecture for Primate Vocalization Classification [PDF] [Copy] [Kimi1] [REL]

Authors: Robert Müller, Steffen Illium, Claudia Linnhoff-Popien

Wildlife monitoring is an essential part of most conservation efforts where one of the many building blocks is acoustic monitoring. Acoustic monitoring has the advantage of being non-invasive and applicable in areas of high vegetation. In this work, we present a deep and recurrent architecture for the classification of primate vocalizations that is based upon well proven modules such as bidirectional Long Short-Term Memory neural networks, pooling, normalized softmax and focal loss. Additionally, we apply Bayesian optimization to obtain a suitable set of hyperparameters. We test our approach on a recently published dataset of primate vocalizations that were recorded in an African wildlife sanctuary. Using an ensemble of the best five models found during hyperparameter optimization on the development set, we achieve a Unweighted Average Recall of 89.3% on the test set. Our approach outperforms the best baseline, an ensemble of various deep and shallow classifiers, which achieves a UAR of 87.5%.


#8 Introducing a Central African Primate Vocalisation Dataset for Automated Species Classification [PDF] [Copy] [Kimi1] [REL]

Authors: Joeri A. Zwerts, Jelle Treep, Casper S. Kaandorp, Floor Meewis, Amparo C. Koot, Heysem Kaya

Automated classification of animal vocalisations is a potentially powerful wildlife monitoring tool. Training robust classifiers requires sizable annotated datasets, which are not easily recorded in the wild. To circumvent this problem, we recorded four primate species under semi-natural conditions in a wildlife sanctuary in Cameroon with the objective to train a classifier capable of detecting species in the wild. Here, we introduce the collected dataset, describe our approach and initial results of classifier development. To increase the efficiency of the annotation process, we condensed the recordings with an energy/change based automatic vocalisation detection. Segmenting the annotated chunks into training, validation and test sets, initial results reveal up to 82% unweighted average recall test set performance in four-class primate species classification.


#9 Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild [PDF] [Copy] [Kimi1] [REL]

Authors: Georgios Rizos, Jenna Lawson, Zhuoda Han, Duncan Butler, James Rosindell, Krystian Mikolajczyk, Cristina Banks-Leite, Björn W. Schuller

We study deep bioacoustic event detection through multi-head attention based pooling, exemplified by wildlife monitoring. In the multiple instance learning framework, a core deep neural network learns a projection of the input acoustic signal into a sequence of embeddings, each representing a segment of the input. Sequence pooling is then required to aggregate the information present in the sequence such that we have a single clip-wise representation. We propose an improvement based on Squeeze-and-Excitation mechanisms upon a recently proposed audio tagging ResNet, and show that it performs significantly better than the baseline, as well as a collection of other recent audio models. We then further enhance our model, by performing an extensive comparative study of recent sequence pooling mechanisms, and achieve our best result using multi-head self-attention followed by concatenation of the head-specific pooled embeddings — better than prediction pooling methods, as well as compared to other recent sequence pooling tricks. We perform these experiments on a novel dataset of spider monkey whinny calls we introduce here, recorded in a rainforest in the South-Pacific coast of Costa Rica, with a promising outlook pertaining to minimally invasive wildlife monitoring.


#10 Identifying Conflict Escalation and Primates by Using Ensemble X-Vectors and Fisher Vector Features [PDF] [Copy] [Kimi1] [REL]

Authors: José Vicente Egas-López, Mercedes Vetráb, László Tóth, Gábor Gosztolya

Computational paralinguistics is concerned with the automatic identification of non-verbal information in human speech. The Interspeech ComParE challenge features new paralinguistic tasks each year; this time, among others, a cross-corpus conflict escalation task and the identification of primates based solely on audio are the actual problems set. In our entry to ComParE 2021, we utilize x-vectors and Fisher vectors as features. To improve the robustness of the predictions, we also experiment with building an ensemble of classifiers from the x-vectors. Lastly, we exploit the fact that the Escalation Sub-Challenge is a conflict detection task, and incorporate the SSPNet Conflict Corpus in our training workflow. Using these approaches, at the time of writing, we had already surpassed the official Challenge baselines on both tasks, which demonstrates the efficiency of the employed techniques.


#11 Ensemble-Within-Ensemble Classification for Escalation Prediction from Speech [PDF] [Copy] [Kimi2] [REL]

Authors: Oxana Verkholyak, Denis Dresvyanskiy, Anastasia Dvoynikova, Denis Kotov, Elena Ryumina, Alena Velichko, Danila Mamontov, Wolfgang Minker, Alexey Karpov

Conflict situations arise frequently in our daily life and often require timely response to resolve the issues. In order to automatically classify conflict (also referred to as escalation) speech utterances we propose ensemble learning as it improves prediction performance by combining several heterogeneous models that compensate for each other’s weaknesses. However, the effectiveness of the classification ensemble greatly depends on its constituents and their fusion strategy. This paper provides experimental evidence for effectiveness of different prediction-level fusion strategies and demonstrates the performance of each proposed ensemble on the Escalation Sub-Challenge (ESS) in the framework of the Computational Paralinguistics Challenge (ComParE-2021). The ensembles comprise various machine learning approaches based on acoustic and linguistic characteristics of speech. The training strategy is specifically designed to increase the generalization performance on the unseen data, while the diverse nature of ensemble candidates ensures high prediction power and accurate classification.


#12 Analysis by Synthesis: Using an Expressive TTS Model as Feature Extractor for Paralinguistic Speech Classification [PDF] [Copy] [Kimi2] [REL]

Authors: Dominik Schiller, Silvan Mertes, Pol van Rijn, Elisabeth André

Modeling adequate features of speech prosody is one key factor to good performance in affective speech classification. However, the distinction between the prosody that is induced by ‘how’ something is said (i.e., affective prosody) and the prosody that is induced by ‘what’ is being said (i.e., linguistic prosody) is neglected in state-of-the-art feature extraction systems. This results in high variability of the calculated feature values for different sentences that are spoken with the same affective intent, which might negatively impact the performance of the classification. While this distinction between different prosody types is mostly neglected in affective speech recognition, it is explicitly modeled in expressive speech synthesis to create controlled prosodic variation. In this work, we use the expressive Text-To-Speech model Global Style Token Tacotron to extract features for a speech analysis task. We show that the learned prosodic representations outperform state-of-the-art feature extraction systems in the exemplary use case of Escalation Level Classification.


#13 Speaker-Conversation Factorial Designs for Diarization Error Analysis [PDF] [Copy] [Kimi2] [REL]

Authors: Scott Seyfarth, Sundararajan Srinivasan, Katrin Kirchhoff

Speaker diarization accuracy can be affected by both acoustics and conversation characteristics. Determining the cause of diarization errors is difficult because speaker voice acoustics and conversation structure co-vary, and the interactions between acoustics, conversational structure, and diarization accuracy are complex. This paper proposes a methodology that can distinguish independent marginal effects of acoustic and conversation characteristics on diarization accuracy by remixing conversations in a factorial design. As an illustration, this approach is used to investigate gender-related and language-related accuracy differences with three diarization systems: a baseline system using subsegment x-vector clustering, a variant of it with shorter subsegments, and a third system based on a Bayesian hidden Markov model. Our analysis shows large accuracy disparities for the baseline system primarily due to conversational structure, which are partially mitigated in the other two systems. The illustration thus demonstrates how the methodology can be used to identify and guide diarization model improvements.


#14 SmallER: Scaling Neural Entity Resolution for Edge Devices [PDF] [Copy] [Kimi2] [REL]

Authors: Ross McGowan, Jinru Su, Vince DiCocco, Thejaswi Muniyappa, Grant P. Strimel

In this paper we introduce SmallER, a scalable neural entity resolution system capable of running directly on edge devices. SmallER addresses constraints imposed by the on-device setting such as bounded memory consumption for both model and catalog storage, limited compute resources, and related latency challenges introduced by those restrictions. Our model includes distinct modules to learn syntactic and semantic information and is trained to handle multiple domains within one compact architecture. We use compressed tries to reduce the space required to store catalogs and a novel implementation of spatial partitioning trees to strike a balance between reducing runtime latency and preserving recall relative to full catalog search. Our final model consumes only 3MB of memory at inference time with classification accuracy surpassing that of previously established, domain-specific baseline models on live customer utterances. For the largest catalogs we consider (300 or more entries), our proxy metric for runtime latency is reduced by more than 90%.


#15 Disfluency Detection with Unlabeled Data and Small BERT Models [PDF] [Copy] [Kimi2] [REL]

Authors: Johann C. Rocholl, Vicky Zayats, Daniel D. Walker, Noah B. Murad, Aaron Schneider, Daniel J. Liebling

Disfluency detection models now approach high accuracy on English text. However, little exploration has been done in improving the size and inference time of the model. At the same time, Automatic Speech Recognition (ASR) models are moving from server-side inference to local, on-device inference. Supporting models in the transcription pipeline (like disfluency detection) must follow suit. In this work we concentrate on the disfluency detection task, focusing on small, fast, on-device models based on the BERT architecture. We demonstrate it is possible to train disfluency detection models as small as 1.3 MiB, while retaining high performance. We build on previous work that showed the benefit of data augmentation approaches such as self-training. Then, we evaluate the effect of domain mismatch between conversational and written text on model performance. We find that domain adaptation and data augmentation strategies have a more pronounced effect on these smaller models, as compared to conventional BERT models.


#16 Discriminative Self-Training for Punctuation Prediction [PDF] [Copy] [Kimi2] [REL]

Authors: Qian Chen, Wen Wang, Mengzhe Chen, Qinglin Zhang

Punctuation prediction for automatic speech recognition (ASR) output transcripts plays a crucial role for improving the readability of the ASR transcripts and for improving the performance of downstream natural language processing applications. However, achieving good performance on punctuation prediction often requires large amounts of labeled speech transcripts, which is expensive and laborious. In this paper, we propose a Discriminative Self-Training approach with weighted loss and discriminative label smoothing to exploit unlabeled speech transcripts. Experimental results on the English IWSLT2011 benchmark test set and an internal Chinese spoken language dataset demonstrate that the proposed approach achieves significant improvement on punctuation prediction accuracy over strong baselines including BERT, RoBERTa, and ELECTRA models. The proposed Discriminative Self-Training approach outperforms the vanilla self-training approach. We establish a new state-of-the-art (SOTA) on the IWSLT2011 test set, outperforming the current SOTA model by 1.3% absolute gain on F1.


#17 Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks Using Switching Tokens [PDF] [Copy] [Kimi2] [REL]

Authors: Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura

In this paper, we propose a novel spoken-text-style conversion method that can simultaneously execute multiple style conversion modules such as punctuation restoration and disfluency deletion without preparing matched datasets. In practice, transcriptions generated by automatic speech recognition systems are not highly readable because they often include many disfluencies and do not include punctuation marks. To improve their readability, multiple spoken-text-style conversion modules that individually model a single conversion task are cascaded because matched datasets that simultaneously handle multiple conversion tasks are often unavailable. However, the cascading is unstable against the order of tasks because of the chain of conversion errors. Besides, the computation cost of the cascading must be higher than the single conversion. To execute multiple conversion tasks simultaneously without preparing matched datasets, our key idea is to distinguish individual conversion tasks using the on-off switch. In our proposed zero-shot joint modeling, we switch the individual tasks using multiple switching tokens, enabling us to utilize a zero-shot learning approach to executing simultaneous conversions. Our experiments on joint modeling of disfluency deletion and punctuation restoration demonstrate the effectiveness of our method.


#18 A Noise Robust Method for Word-Level Pronunciation Assessment [PDF] [Copy] [Kimi2] [REL]

Authors: Binghuai Lin, Liyuan Wang

The common approach for pronunciation evaluation is based on Goodness of pronunciation (GOP). It has been found that GOP may perform worse under noise conditions. Traditional methods compensate pronunciation features to improve the performance of pronunciation assessment in noise situations. This paper proposed a noise robust model for word-level pronunciation assessment based on a domain adversarial training (DAT) method. We treat the pronunciation assessment in the clean and noise situations as the source and target domains. The network is optimized by incorporating both the pronunciation assessment and noise domain discrimination. The domain labels are generated from unsupervised methods to adapt to various noise situations. We evaluate the model performance based on English words recorded by Chinese English learners and labeled by three experts. Experimental results show on average the proposed model outperforms the baseline by 3% in Pearson correlation coefficients (PCC) and 4% in accuracy under different noise conditions.


#19 Targeted Keyword Filtering for Accelerated Spoken Topic Identification [PDF] [Copy] [Kimi2] [REL]

Author: Jonathan Wintrode

We present a novel framework for spoken topic identification that simultaneously learns both topic-specific keywords and acoustic keyword filters from only document-level topic labels. At inference time, only audio segments likely to contain topic-salient keywords are fully decoded, reducing the system’s overall computation cost. We show that this filtering allows for effective topic classification while decoding only 50% of ASR output word lattices, and achieves error rates within 1.2% and precision within 2.6% of an unfiltered baseline system.


#20 Multimodal Speech Summarization Through Semantic Concept Learning [PDF] [Copy] [Kimi2] [REL]

Authors: Shruti Palaskar, Ruslan Salakhutdinov, Alan W. Black, Florian Metze

We propose a cascaded multimodal abstractive speech summarization model that generates semantic concepts as an intermediate step towards summarization. We describe a method to leverage existing multimodal dataset annotations to curate groundtruth labels for such intermediate concept modeling. In addition to cascaded training, the concept labels also provide an interpretable intermediate output level that helps improve performance on the downstream summarization task. On the open-domain How2 data, we conduct utterance-level and video-level experiments for two granularities of concepts: Specific and Abstract. We compare various multimodal fusion models for concept generation based on the respective input modalities. We observe consistent improvements in concept modeling by using multimodal adaptation models over unimodal models. Using the cascaded multimodal speech summarization model, we see a significant improvement of 7.5 METEOR points and 5.1 ROUGE-L points compared to previous methods of speech summarization. Finally, we show the benefits of scalability of the proposed approaches on 2000 h of video data.


#21 Enhancing Semantic Understanding with Self-Supervised Methods for Abstractive Dialogue Summarization [PDF] [Copy] [Kimi2] [REL]

Authors: Hyunjae Lee, Jaewoong Yun, Hyunjin Choi, Seongho Joe, Youngjune L. Gwon

Contextualized word embeddings can lead to state-of-the-art performances in natural language understanding. Recently, a pre-trained deep contextualized text encoder such as BERT has shown its potential in improving natural language tasks including abstractive summarization. Existing approaches in dialogue summarization focus on incorporating a large language model into summarization task trained on large-scale corpora consisting of news articles rather than dialogues of multiple speakers. In this paper, we introduce self-supervised methods to compensate shortcomings to train a dialogue summarization model. Our principle is to detect incoherent information flows using pretext dialogue text to enhance BERT’s ability to contextualize the dialogue text representations. We build and fine-tune an abstractive dialogue summarization model on a shared encoder-decoder architecture using the enhanced BERT. We empirically evaluate our abstractive dialogue summarizer with the SAMSum corpus, a recently introduced dataset with abstractive dialogue summaries. All of our methods have contributed improvements to abstractive summary measured in ROUGE scores. Through an extensive ablation study, we also present a sensitivity analysis to critical model hyperparameters, probabilities of switching utterances and masking interlocutors.


#22 Speaker Transition Patterns in Three-Party Conversation: Evidence from English, Estonian and Swedish [PDF] [Copy] [Kimi] [REL]

Authors: Marcin Włodarczak, Emer Gilmartin

During conversation, speakers hold and relinquish the floor, resulting in turn yield and retention. We examine these phenomena in three-party conversations in English, Swedish, and Estonian. We define within- and between-speaker transitions in terms of shorter intervals of speech, silence and overlap bounded by stretches of one-party speech longer than 1 second by the same or different speakers. This method gives us insights into how turn change and retention proceed, revealing that the majority of speaker transitions are more complex and involve more intermediate activity than a single silence or overlap. We examine the composition of within and between transitions in terms of number of speakers involved, incidence and proportion of solo speech, silence and overlap. We derive the most common within- and between-speaker transitions in the three languages, finding evidence of striking commonalities in how the floor is managed. Our findings suggest that current models of turn-taking used in dialogue technology could be extended using these results to more accurately reflect the realities of human-human dialogue.


#23 Data Augmentation for Spoken Language Understanding via Pretrained Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Baolin Peng, Chenguang Zhu, Michael Zeng, Jianfeng Gao

The training of spoken language understanding (SLU) models often faces the problem of data scarcity. In this paper, we put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances. Furthermore, we investigate and propose solutions to two previously overlooked semi-supervised learning scenarios of data scarcity in SLU: i) Rich-in-Ontology: ontology information with numerous valid dialogue acts is given; ii) Rich-in-Utterance: a large number of unlabelled utterances are available. Empirical results show that our method can produce synthetic training data that boosts the performance of language understanding models in various scenarios.


#24 FANS: Fusing ASR and NLU for On-Device SLU [PDF] [Copy] [Kimi2] [REL]

Authors: Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow

Spoken language understanding (SLU) systems translate voice input commands to semantics which are encoded as an intent and pairs of slot tags and values. Most current SLU systems deploy a cascade of two neural models where the first one maps the input audio to a transcript (ASR) and the second predicts the intent and slots from the transcript (NLU). In this paper, we introduce FANS, a new end-to-end SLU model that fuses an ASR audio encoder to a multi-task NLU decoder to infer the intent, slot tags, and slot values directly from a given input audio, obviating the need for transcription. FANS consists of a shared audio encoder and three decoders, two of which are seq-to-seq decoders that predict non null slot tags and slot values in parallel and in an auto-regressive manner. FANS neural encoder and decoders architectures are flexible which allows us to leverage different combinations of LSTM, self-attention, and attenders. Our experiments show compared to the state-of-the-art end-to-end SLU models, FANS reduces ICER and IRER errors relatively by 30% and 7%, respectively, when tested on an in-house SLU dataset and by 0.86% and 2% absolute when tested on a public SLU dataset.


#25 Sequential End-to-End Intent and Slot Label Classification and Localization [PDF] [Copy] [Kimi2] [REL]

Authors: Yiran Cao, Nihal Potdar, Anderson R. Avila

Human-computer interaction (HCI) is significantly impacted by delayed responses from a spoken dialogue system. Hence, end-to-end (e2e) spoken language understanding (SLU) solutions have recently been proposed to decrease latency. Such approaches allow for the extraction of semantic information directly from the speech signal, thus bypassing the need for a transcript from an automatic speech recognition (ASR) system. In this paper, we propose a compact e2e SLU architecture for streaming scenarios, where chunks of the speech signal are processed continuously to predict intent and slot values. Our model is based on a 3D convolutional neural network (3D-CNN) and a unidirectional long short-term memory (LSTM). We compare the performance of two alignment-free losses: the connectionist temporal classification (CTC) method and its adapted version, namely connectionist temporal localization (CTL). The latter performs not only the classification but also localization of sequential audio events. The proposed solution is evaluated on the Fluent Speech Command dataset and results show our model ability to process incoming speech signal, reaching accuracy as high as 98.97% for CTC and 98.78% for CTL on single-label classification, and as high as 95.69% for CTC and 95.28% for CTL on two-label prediction.