INTERSPEECH.2021 - Others

Total: 348

#1 Acoustic Indicators of Speech Motor Coordination in Adults With and Without Traumatic Brain Injury [PDF] [Copy] [Kimi1]

Authors: Tanya Talkar ; Nancy Pearl Solomon ; Douglas S. Brungart ; Stefanie E. Kuchinsky ; Megan M. Eitel ; Sara M. Lippa ; Tracey A. Brickell ; Louis M. French ; Rael T. Lange ; Thomas F. Quatieri

A traumatic brain injury (TBI) can lead to various long-term effects on memory, attention, and mood, as well as the occurrence of headaches, speech, and hearing problems. There is a need to better understand the long-term effects of a TBI for objective tracking of an individual’s recovery, which could be used to determine intervention trajectories. This study utilizes acoustic features derived from recordings of speech tasks completed by active-duty service members and veterans (SMVs) enrolled in the Defense and Veterans Brain Injury (DVBIC)/Traumatic Brain Injury Center of Excellence (TBICoE) 15-Year Longitudinal TBI Study. We hypothesize that the individuals diagnosed with moderate to severe TBI would demonstrate motor speech impairments through decreased coordination of the speech production subsystems as compared to individuals with no history of TBI. Speech motor coordination is measured through correlations of acoustic feature time series representing speech subsystems. Eigenspectra derived from these correlations are utilized in machine learning models to discriminate between the two groups. The fusion of correlation features derived from the recordings achieves an AUC of 0.78. This suggests that residual motor impairments from moderate to severe TBI could be detectable through objective measures of speech motor coordination.

#2 On Modeling Glottal Source Information for Phonation Assessment in Parkinson’s Disease [PDF] [Copy] [Kimi1]

Authors: J.C. Vásquez-Correa ; Julian Fritsch ; J.R. Orozco-Arroyave ; Elmar Nöth ; Mathew Magimai-Doss

Parkinson’s disease produces several motor symptoms, including different speech impairments that are known as hypokinetic dysarthria. Symptoms associated to dysarthria affect different dimensions of speech such as phonation, articulation, prosody, and intelligibility. Studies in the literature have mainly focused on the analysis of articulation and prosody because they seem to be the most prominent symptoms associated to dysarthria severity. However, phonation impairments also play a significant role to evaluate the global speech severity of Parkinson’s patients. This paper proposes an extensive comparison of different methods to automatically evaluate the severity of specific phonation impairments in Parkinson’s patients. The considered models include the computation of perturbation and glottal-based features, in addition to features extracted from a zero frequency filtered signals. We consider as well end-to-end models based on 1D CNNs, which are trained to learn features from the raw speech waveform, reconstructed glottal signals, and zero-frequency filtered signals. The results indicate that it is possible to automatically classify between speakers with low versus high phonation severity due to the presence of dysarthria and at the same time to evaluate the severity of the phonation impairments on a continuous scale, posed as a regression problem.

#3 Distortion of Voiced Obstruents for Differential Diagnosis Between Parkinson’s Disease and Multiple System Atrophy [PDF] [Copy] [Kimi1]

Authors: Khalid Daoudi ; Biswajit Das ; Solange Milhé de Saint Victor ; Alexandra Foubert-Samier ; Anne Pavy-Le Traon ; Olivier Rascol ; Wassilios G. Meissner ; Virginie Woisard

Parkinson’s disease (PD) and the parkinsonian variant of Multiple System Atrophy (MSA-P) are two neurodegenerative diseases which share similar clinical features, particularly in early disease stages. The differential diagnosis can be thus very challenging. Dysarthria is known to be a frequent and early clinical feature of PD and MSA. It can be thus used as a vehicle to provide a vocal biomarker which could help in the differential diagnosis. In particular, distortion of consonants is known to be a frequent impairment in these diseases. The aim of this study is to investigate distinctive patterns in the distortion of voiced obstruents (plosives and fricatives). It is the first study which attempts to examine such distortions in the French language for the purpose of the differential diagnosis between PD and MSA-P (and among the very few studies if we consider all languages). We carry out a perceptual and objective analysis of voiced obstruents extracted from isolated pseudo-words initials. We first show that devoicing is a significant impairment which predominates in MSA-P. We then show that voice onset time (VOT) of voiced plosives (prevoicing duration) can be a complementary feature to improve the accuracy in discrimination between PD and MSA-P.

#4 A Study into Pre-Training Strategies for Spoken Language Understanding on Dysarthric Speech [PDF] [Copy] [Kimi1]

Authors: Pu Wang ; Bagher BabaAli ; Hugo Van hamme

End-to-end (E2E) spoken language understanding (SLU) systems avoid an intermediate textual representation by mapping speech directly into intents with slot values. This approach requires considerable domain-specific training data. In low-resource scenarios this is a major concern, e.g., in the present study dealing with SLU for dysarthric speech. Pretraining part of the SLU model for automatic speech recognition targets helps but no research has shown to which extent SLU on dysarthric speech benefits from knowledge transferred from other dysarthric speech tasks. This paper investigates the efficiency of pre-training strategies for SLU tasks on dysarthric speech. The designed SLU system consists of a TDNN acoustic model for feature encoding and a capsule network for intent and slot decoding. The acoustic model is pre-trained in two stages: initialization with a corpus of normal speech and finetuning on a mixture of dysarthric and normal speech. By introducing the intelligibility score as a metric of the impairment severity, this paper quantitatively analyzes the relation between generalization and pathology severity for dysarthric speech.

#5 EasyCall Corpus: A Dysarthric Speech Dataset [PDF] [Copy] [Kimi1]

Authors: Rosanna Turrisi ; Arianna Braccia ; Marco Emanuele ; Simone Giulietti ; Maura Pugliatti ; Mariachiara Sensi ; Luciano Fadiga ; Leonardo Badino

This paper introduces a new dysarthric speech command dataset in Italian, called EasyCall corpus. The dataset consists of 21386 audio recordings from 24 healthy and 31 dysarthric speakers, whose individual degree of speech impairment was assessed by neurologists through the Therapy Outcome Measure. The corpus aims at providing a resource for the development of ASR-based assistive technologies for patients with dysarthria. In particular, it may be exploited to develop a voice-controlled contact application for commercial smartphones, aiming at improving dysarthric patients’ ability to communicate with their family and caregivers. Before recording the dataset, participants were administered a survey to evaluate which commands are more likely to be employed by dysarthric individuals in a voice-controlled contact application. In addition, the dataset includes a list of non-commands (i.e., words near/inside commands or phonetically close to commands) that can be leveraged to build a more robust command recognition system. At present commercial ASR systems perform poorly on the EasyCall Corpus as we report in this paper. This result corroborates the need for dysarthric speech corpora for developing effective assistive technologies. To the best of our knowledge, this database represents the richest corpus of dysarthric speech to date.

#6 Adaptive Convolutional Neural Network for Text-Independent Speaker Recognition [PDF] [Copy] [Kimi1]

Authors: Seong-Hu Kim ; Yong-Hwa Park

In text-independent speaker recognition, each speech is composed of different phonemes depending on spoken text. The conventional neural networks for speaker recognition are static models, so they do not reflect this phoneme-varying characteristic well. To tackle this limitation, we propose an adaptive convolutional neural network (ACNN) for text-independent speaker recognition. The utterance is divided along the time axis into short segments with small fluctuating phonemes. Frame-level features are extracted by applying input-dependent kernels adaptive to each segment. By applying time average pooling and linear layers, utterance-level embeddings extraction and speaker recognition are performed. Adaptive VGG-M using 0.356 seconds segmentation shows better speaker recognition performance than baseline models, with a Top-1 of 86.51% and an EER of 5.68%. It extracts more accurate frame-level embeddings for vowel and nasal phonemes compared to the conventional method without overfitting and large parameters. This framework for text-independent speaker recognition effectively utilizes phonemes and text-varying characteristic of speech.

#7 Bidirectional Multiscale Feature Aggregation for Speaker Verification [PDF] [Copy] [Kimi1]

Authors: Jiajun Qi ; Wu Guo ; Bin Gu

In this paper, we propose a novel bidirectional multiscale feature aggregation (BMFA) network with attentional fusion modules for text-independent speaker verification. The feature maps from different stages of the backbone network are iteratively combined and refined in both a bottom-up and top-down manner. Furthermore, instead of simple concatenation or elementwise addition of feature maps from different stages, an attentional fusion module is designed to compute the fusion weights. Experiments are conducted on the NIST SRE16 and VoxCeleb1 datasets. The experimental results demonstrate the effectiveness of the bidirectional aggregation strategy and show that the proposed attentional fusion module can further improve the performance.

#8 Improving Time Delay Neural Network Based Speaker Recognition with Convolutional Block and Feature Aggregation Methods [PDF] [Copy] [Kimi1]

Authors: Yu-Jia Zhang ; Yih-Wen Wang ; Chia-Ping Chen ; Chung-Li Lu ; Bo-Cheng Chan

In this paper, we develop a system that integrates multiple ideas and techniques inspired by the convolutional block and feature aggregation methods. We begin with the state-of-the-art speaker-embedding model for speaker recognition, namely the model of Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network, and then gradually experiment with the proposed network modules, including bottleneck residual blocks, attention mechanisms, and feature aggregation methods. In our final model, we replace the Res2Block with SC-Block and we use a hierarchical architecture for feature aggregation. We evaluate the performance of our model on the VoxCeleb1 test set and the 2020 VoxCeleb Speaker Recognition Challenge (VoxSRC20) validation set. The relative improvement of the proposed models over ECAPA-TDNN is 22.8% on VoxCeleb1 and 18.2% on VoxSRC20.

#9 Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification [PDF] [Copy] [Kimi1]

Authors: Yanfeng Wu ; Junan Zhao ; Chenkai Guo ; Jing Xu

Deep Convolutional Neural Network (CNN) based speaker embeddings, such as r-vectors, have shown great success in text-independent speaker verification (TI-SV) task. However, previous deep CNN models usually use fixed-length samples for training and employ variable-length utterances for speaker embeddings, which generates a mismatch between training and embedding. To address this issue, we investigate the effect of employing variable-length training samples on CNN-based TI-SV systems and explore two approaches to improve the performance of deep CNN architectures on TI-SV through capturing variable-term contexts. Firstly, we present an improved selective kernel convolution which allows the networks to adaptively switch between short-term and long-term contexts based on variable-length utterances. Secondly, we propose a multi-scale statistics pooling method to aggregate multiple time-scale features from different layers of the networks. We build a novel ResNet34 based architecture with two proposed approaches. Experiments are conducted on the VoxCeleb datasets. The results demonstrate that the effect of using variable-length samples is diverse in different networks and the architecture with two proposed approaches achieves significant improvement over r-vectors baseline system.

#10 Binary Neural Network for Speaker Verification [PDF] [Copy] [Kimi1]

Authors: Tinglong Zhu ; Xiaoyi Qin ; Ming Li

Although deep neural networks are successful for many tasks in the speech domain, the high computational and memory costs of deep neural networks make it difficult to directly deploy high-performance Neural Network systems on low-resource embedded devices. There are several mechanisms to reduce the size of the neural networks i.e. parameter pruning, parameter quantization, etc. This paper focuses on how to apply binary neural networks to the task of speaker verification. The proposed binarization of training parameters can largely maintain the performance while significantly reducing storage space requirements and computational costs. Experiment results show that, after binarizing the Convolutional Neural Network, the ResNet34-based network achieves an EER of around 5% on the Voxceleb1 testing dataset and even outperforms the traditional real number network on the text-dependent dataset: Xiaole while having a 32× memory saving.

#11 Mutual Information Enhanced Training for Speaker Embedding [PDF] [Copy] [Kimi1]

Authors: Youzhi Tu ; Man-Wai Mak

Mutual information (MI) is useful in unsupervised and self-supervised learning. Maximizing the MI between the low-level features and the learned embeddings can preserve meaningful information in the embeddings, which can contribute to performance gains. This strategy is called deep InfoMax (DIM) in representation learning. In this paper, we follow the DIM framework so that the speaker embeddings can capture more information from the frame-level features. However, a straightforward implementation of DIM may pose a dimensionality imbalance problem because the dimensionality of the frame-level features is much larger than that of the speaker embeddings. This problem can lead to unreliable MI estimation and can even cause detrimental effects on speaker verification. To overcome this problem, we propose to squeeze the frame-level features before MI estimation through some global pooling methods. We call the proposed method squeeze-DIM. Although the squeeze operation inevitably introduces some information loss, we empirically show that the squeeze-DIM can achieve performance gains on both Voxceleb1 and VOiCES-19 tasks. This suggests that the squeeze operation facilitates the MI estimation and maximization in a balanced dimensional space, which helps learn more informative speaker embeddings.

#12 Y-Vector: Multiscale Waveform Encoder for Speaker Embedding [PDF] [Copy] [Kimi1]

Authors: Ge Zhu ; Fei Jiang ; Zhiyao Duan

State-of-the-art text-independent speaker verification systems typically use cepstral features or filter bank energies as speech features. Recent studies attempted to extract speaker embeddings directly from raw waveforms and have shown competitive results. In this paper, we propose a novel multi-scale waveform encoder that uses three convolution branches with different time scales to compute speech features from the waveform. These features are then processed by squeeze-and-excitation blocks, a multi-level feature aggregator, and a time delayed neural network (TDNN) to compute speaker embedding. We show that the proposed embeddings outperform existing raw-waveform-based speaker embeddings on speaker verification by a large margin. A further analysis of the learned filters shows that the multi-scale encoder attends to different frequency bands at its different scales while resulting in a more flat overall frequency response than any of the single-scale counterparts.

#13 Phoneme-Aware and Channel-Wise Attentive Learning for Text Dependent Speaker Verification [PDF] [Copy] [Kimi1]

Authors: Yan Liu ; Zheng Li ; Lin Li ; Qingyang Hong

This paper proposes a multi-task learning network with phoneme-aware and channel-wise attentive learning strategies for text-dependent Speaker Verification (SV). In the proposed structure, the frame-level multi-task learning along with the segment-level adversarial learning is adopted for speaker embedding extraction. The phoneme-aware attentive pooling is exploited on frame-level features in the main network for speaker classifier, with the corresponding posterior probability for the phoneme distribution in the auxiliary subnet. Further, the introduction of Squeeze and Excitation (SE-block) performs dynamic channel-wise feature recalibration, which improves the representational ability. The proposed method exploits speaker idiosyncrasies associated with pass-phrases, and is further improved by the phoneme-aware attentive pooling and SE-block from temporal and channel-wise aspects, respectively. The experiments conducted on RSR2015 Part 1 database confirm that the proposed system achieves outstanding results for text-dependent SV.

#14 Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [PDF] [Copy] [Kimi1]

Authors: Hongning Zhu ; Kong Aik Lee ; Haizhou Li

This paper proposes a serialized multi-layer multi-head attention for neural speaker embedding in text-independent speaker verification. In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms to derive refined features that are more correlated with speakers. Serialized attention mechanism contains a stack of self-attention modules to create fixed-dimensional representations of speakers. Instead of utilizing multi-head attention in parallel, the proposed serialized multi-layer multi-head attention is designed to aggregate and propagate attentive statistics from one layer to the next in a serialized manner. In addition, we employ an input-aware query for each utterance with the statistics pooling. With more layers stacked, the neural network can learn more discriminative speaker embeddings. Experiment results on VoxCeleb1 dataset and SITW dataset show that our proposed method outperforms other baseline methods, including x-vectors and other x-vectors + conventional attentive pooling approaches by 9.7% in EER and 8.1% in DCF10-2.

#15 User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems [PDF] [Copy] [Kimi1]

Authors: Hoang Long Nguyen ; Vincent Renkens ; Joris Pelemans ; Srividya Pranavi Potharaju ; Anil Kumar Nalamalapu ; Murat Akbacak

Recognition errors are common in human communication. Similar errors often lead to unwanted behaviour in dialogue systems or virtual assistants. In human communication, we can recover from them by repeating misrecognized words or phrases; however in human-machine communication this recovery mechanism is not available. In this paper, we attempt to bridge this gap and present a system that allows a user to correct speech recognition errors in a virtual assistant by repeating misunderstood words. When a user repeats part of the phrase the system rewrites the original query to incorporate the correction. This rewrite allows the virtual assistant to understand the original query successfully. We present an end-to-end 2-step attention pointer network that can generate the the rewritten query by merging together the incorrectly understood utterance with the correction follow-up. We evaluate the model on data collected for this task and compare the proposed model to a rule-based baseline and a standard pointer network. We show that rewriting the original query is an effective way to handle repetition-based recovery and that the proposed model outperforms the rule based baseline, reducing Word Error Rate by 19% relative at 2% False Alarm Rate on annotated data.

#16 Self-Supervised Dialogue Learning for Spoken Conversational Question Answering [PDF] [Copy] [Kimi1]

Authors: Nuo Chen ; Chenyu You ; Yuexian Zou

In spoken conversational question answering (SCQA), the answer to the corresponding question is generated by retrieving and then analyzing a fixed spoken document, including multi-part conversations. Most SCQA systems have considered only retrieving information from ordered utterances. However, the sequential order of dialogue is important to build a robust spoken conversational question answering system, and the changes of utterances order may severely result in low-quality and incoherent corpora. To this end, we introduce a self-supervised learning approach, including incoherence discrimination, insertion detection, and question prediction, to explicitly capture the coreference resolution and dialogue coherence among spoken documents. Specifically, we design a joint learning framework where the auxiliary self-supervised tasks can enable the pre-trained SCQA systems towards more coherent and meaningful spoken dialogue learning. We also utilize the proposed self-supervised learning tasks to capture intra-sentence coherence. Experimental results demonstrate that our proposed method provides more coherent, meaningful, and appropriate responses, yielding superior performance gains compared to the original pre-trained language models. Our method achieves state-of-the-art results on the Spoken-CoQA dataset.

#17 Act-Aware Slot-Value Predicting in Multi-Domain Dialogue State Tracking [PDF] [Copy] [Kimi1]

Authors: Ruolin Su ; Ting-Wei Wu ; Biing-Hwang Juang

As an essential component in task-oriented dialogue systems, dialogue state tracking (DST) aims to track human-machine interactions and generate state representations for managing the dialogue. Representations of dialogue states are dependent on the domain ontology and the user’s goals. In several task-oriented dialogues with a limited scope of objectives, dialogue states can be represented as a set of slot-value pairs. As the capabilities of dialogue systems expand to support increasing naturalness in communication, incorporating dialogue act processing into dialogue model design becomes essential. The lack of such consideration limits the scalability of dialogue state tracking models for dialogues having specific objectives and ontology. To address this issue, we formulate and incorporate dialogue acts, and leverage recent advances in machine reading comprehension to predict both categorical and non-categorical types of slots for multi-domain dialogue state tracking. Experimental results show that our models can improve the overall accuracy of dialogue state tracking on the MultiWOZ 2.1 dataset, and demonstrate that incorporating dialogue acts can guide dialogue state design for future task-oriented dialogue systems.

#18 Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information [PDF] [Copy] [Kimi1]

Authors: Yuya Chiba ; Ryuichiro Higashinaka

In recent years, dialogue systems have been applied to daily living. Such systems should be able to associate conversations with dialogue situations, such as a place where a dialogue occurs and the relationship between participants. In this study, we propose a dialogue situation recognition method that understands the perspective of dialogue scenes. The target dialogue situations contain dialogue styles, places, activities, and relations between participants. We used the Corpus of Everyday Japanese Conversation (CEJC), which records natural everyday conversations in various situations for experiments. We experimentally verified the effectiveness of our proposed method using multimodal information for situation recognition.

#19 Neural Spoken-Response Generation Using Prosodic and Linguistic Context for Conversational Systems [PDF] [Copy] [Kimi1]

Authors: Yoshihiro Yamazaki ; Yuya Chiba ; Takashi Nose ; Akinori Ito

Spoken dialogue systems have become widely used in daily life. Such a system must interact with the user socially to truly operate as a partner with humans. In studies of recent dialogue systems, neural response generation led to natural response generation. However, these studies have not considered the acoustic aspects of conversational phenomena, such as the adaptation of prosody. We propose a spoken-response generation model that extends a neural conversational model to deal with pitch control signals. Our proposed model is trained using multimodal dialogue between humans. The generated pitch control signals are input to a speech synthesis system to control the pitch of synthesized speech. Our experiment shows that the proposed system can generate synthesized speech with an appropriate F0 contour as an utterance in context compared to the output of a system without pitch control, although language generation remains an issue.

#20 Semantic Transportation Prototypical Network for Few-Shot Intent Detection [PDF] [Copy] [Kimi1]

Authors: Weiyuan Xu ; Peilin Zhou ; Chenyu You ; Yuexian Zou

Few-shot intent detection is a problem that only a few annotated examples are available for unseen intents, and deep models could suffer from the overfitting problem because of scarce data. Existing state-of-the-art few-shot model, Prototypical Network (PN), mainly focus on computing the similarity between examples in a metric space by leveraging sentence-level instance representations. However, sentence-level representations may incorporate highly noisy signals from unrelated words which leads to performance degradation. In this paper, we propose Semantic Transportation Prototypical Network (STPN) to alleviate this issue. Different from the original PN, our approach takes word-level representation as input and uses a new distance metric to obtain better sample matching result. And we reformulate the few-shot classification task into an instance of optimal matching, in which the key word semantic information between examples are expected to be matched and the matching cost is treated as similarity. Specifically, we design Mutual-Semantic mechanism to generate word semantic information, which could reduce the unrelated word noise and enrich key word information. Then, Earth Mover’s Distance (EMD) is applied to find an optimal matching solution. Comprehensive experiments on two benchmark datasets are conducted to validate the effectiveness and generalization of our proposed model.

#21 Domain-Specific Multi-Agent Dialog Policy Learning in Multi-Domain Task-Oriented Scenarios [PDF] [Copy] [Kimi1]

Authors: Li Tang ; Yuke Si ; Longbiao Wang ; Jianwu Dang

Traditional dialog policy learning methods train a generic dialog agent to address all situations. However, when the dialog agent encounters a complicated task that involves more than one domain, it becomes difficult to perform concordant actions due to the hybrid information in the multi-domain ontology. Inspired by a real-life scenario at a bank, there are always several specialized departments that deal with different businesses. In this paper, we propose Domain-Specific Multi-Agent Dialog Policy Learning (DSMADPL), in which the dialog system is composed of a set of agents where each agent represents a specialized skill in a particular domain. Every domain-specific agent is first pretrained with supervised learning using a dialog corpus, and then they are jointly improved with multi-agent reinforcement learning. When the dialog system interacts with the user, in each turn the system action is decided by the actions of relevant agents. Experiments conducted on the commonly used MultiWOZ dataset prove the effectiveness of the proposed method, in which dialog success rate increases from 55.0% for the traditional method to 67.2% for our method in multi-domain scenarios.

#22 Leveraging ASR N-Best in Deep Entity Retrieval [PDF] [Copy] [Kimi1]

Authors: Haoyu Wang ; John Chen ; Majid Laali ; Kevin Durda ; Jeff King ; William Campbell ; Yang Liu

Entity Retrieval (ER) in spoken dialog systems is a task that retrieves entities in a catalog for the entity mentions in user utterances. ER systems are susceptible to upstream errors, with Automatic Speech Recognition (ASR) errors being particularly troublesome. In this work, we propose a robust deep learning based ER system by leveraging ASR N-best hypotheses. Specifically, we evaluate different neural architectures to infuse ASR N-best through an attention mechanism. On 750 hours of audio data taken from live traffic, our best model achieves 11.07% relative error reduction while maintaining the same performance on rejecting out-of-domain ER requests.

#23 Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams [PDF] [Copy] [Kimi1]

Authors: Yuanbo Hou ; Zhesong Yu ; Xia Liang ; Xingjian Du ; Bilei Zhu ; Zejun Ma ; Dick Botteldooren

Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audio-visual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, and the representations are fused by attention based on semantic similarity to shape the acoustic representations through the probability of anchor vocalization. Experiments show the proposed audio-visual multi-branch network far outperforms the audio-only model in challenging acoustic environments, indicating the cross-modal information fusion based on semantic correlation is sensible and successful.

#24 Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection [PDF] [Copy] [Kimi1]

Author: Ui-Hyun Kim

Recent audio-visual voice activity detectors based on supervised learning require large amounts of labeled training data with manual mouth-region cropping in videos, and the performance is sensitive to a mismatch between the training and testing noise conditions. This paper introduces contrastive self-supervised learning for audio-visual voice activity detection as a possible solution to such problems. In addition, a novel self-supervised learning framework is proposed to improve overall training efficiency and testing performance on noise-corrupted datasets, as in real-world scenarios. This framework includes a branched audio encoder and a noise-tolerant loss function to cope with the uncertainty of speech and noise feature separation in a self-supervised manner. Experimental results, particularly under mismatched noise conditions, demonstrate the improved performance compared with a self-supervised learning baseline and a supervised learning framework.

#25 Noisy Student-Teacher Training for Robust Keyword Spotting [PDF] [Copy] [Kimi1]

Authors: Hyun-Jin Park ; Pai Zhu ; Ignacio Lopez Moreno ; Niranjan Subrahmanya

We propose self-training with noisy student-teacher approach for streaming keyword spotting, that can utilize large-scale unlabeled data and aggressive data augmentation. The proposed method applies aggressive data augmentation (spectral augmentation) on the input of both student and teacher and utilize unlabeled data at scale, which significantly boosts the accuracy of student against challenging conditions. Such aggressive augmentation usually degrades model performance when used with supervised training with hard-labeled data. Experiments show that aggressive spec augmentation on baseline supervised training method degrades accuracy, while the proposed self-training with noisy student-teacher training improves accuracy of some difficult-conditioned test sets by as much as 60%.