INTERSPEECH.2024 - Speech Processing

| Total: 92

#1 Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation [PDF] [Copy] [Kimi] [REL]

Authors: Dawei Liang ; Alice Zhang ; David Harwath ; Edison Thomaz

Acoustic scene and event classification is gaining traction in mobile health and wearable applications. Traditionally, relevant research focused on high-quality inputs (sampling rates >= 16 kHz). However, lower sampling rates (e.g., 1 kHz - 2 kHz) offer enhanced privacy and reduced power consumption, crucial for continuous mobile use. This study introduces efficient methods for optimizing pre-trained audio neural networks (PANNs) targeting low-quality audio, employing Born-Again self-distillation (BASD) and a cross-sampling-rate self-distillation (CSSD) strategy. Testing three PANNs with diverse mobile datasets reveals that both strategies boost model inference performance, yielding an absolute accuracy / F1 gain ranging from 1% to 6% compared to a baseline without distillation, while sampling at very low rates (1 kHz - 2 kHz). Notably, CSSD shows greater benefits, suggesting models trained on high-quality audio adapt better to lower resolutions, despite the shift in input quality.

#2 MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection [PDF] [Copy] [Kimi] [REL]

Authors: Da Mu ; Zhicheng Zhang ; Haobo Yue

Sound Event Localization and Detection (SELD) involves detecting and localizing sound events using multichannel sound recordings. Previously proposed Event-Independent Network V2 (EINV2) has achieved outstanding performance on SELD. However, it still faces challenges in effectively extracting features across spectral, spatial, and temporal domains. This paper proposes a three-stage network structure named Multi-scale Feature Fusion (MFF) module to fully extract multi-scale features across spectral, spatial, and temporal domains. The MFF module utilizes parallel subnetworks architecture to generate multi-scale spectral and spatial features. The TF-Convolution Module is employed to provide multi-scale temporal features. We incorporated MFF into EINV2 and term the proposed method as MFF-EINV2. Experimental results in 2022 and 2023 DCASE challenge task3 datasets show the effectiveness of our MFF-EINV2, which achieves state-of-the-art (SOTA) performance compared to published methods.

#3 Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event Detection [PDF] [Copy] [Kimi] [REL]

Authors: Hyeonuk Nam ; Seong-Hu Kim ; Deokki Min ; Junhyeok Lee ; Yong-Hwa Park

Frequency dynamic convolution (FDY conv) has shown the state-of-the-art performance in sound event detection (SED) using frequency-adaptive kernels obtained by frequency-varying combination of basis kernels. However, FDY conv lacks an explicit mean to diversify frequency-adaptive kernels, potentially limiting the performance. In addition, size of basis kernels is limited while time-frequency patterns span larger spectro-temporal range. Therefore, we propose dilated frequency dynamic convolution (DFD conv) which diversifies and expands frequency-adaptive kernels by introducing different dilation sizes to basis kernels. Experiments showed advantages of varying dilation sizes along frequency dimension, and analysis on attention weight variance proved dilated basis kernels are effectively diversified. By adapting class-wise median filter with intersection-based F1 score, proposed DFD-CRNN outperforms FDY-CRNN by 3.12% in terms of polyphonic sound detection score (PSDS).

#4 Stream-based Active Learning for Anomalous Sound Detection in Machine Condition Monitoring [PDF] [Copy] [Kimi] [REL]

Authors: Tuan Vu Ho ; Kota Dohi ; Yohei Kawaguchi

This paper introduces an active learning (AL) framework for anomalous sound detection (ASD) in machine condition monitoring system. Typically, ASD models are trained solely on normal samples due to the scarcity of anomalous data, leading to decreased accuracy for unseen samples during inference. AL is a promising solution to solve this problem by enabling the model to learn new concepts more effectively with fewer labeled examples, thus reducing manual annotation efforts. However, its effectiveness in ASD remains unexplored. To minimize update costs and time, our proposed method focuses on updating the scoring backend of ASD system without retraining the neural network model. Experimental results on the DCASE 2023 Challenge Task 2 dataset confirm that our AL framework significantly improves ASD performance even with low labeling budgets. Moreover, our proposed sampling strategy outperforms other baselines in terms of the partial area under the receiver operating characteristic score.

#5 AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection [PDF] [Copy] [Kimi] [REL]

Authors: Anbai Jiang ; Bing Han ; Zhiqiang Lv ; Yufeng Deng ; Wei-Qiang Zhang ; Xie Chen ; Yanmin Qian ; Jia Liu ; Pingyi Fan

Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement.

#6 FakeSound: Deepfake General Audio Detection [PDF1] [Copy] [Kimi1] [REL]

Authors: Zeyu Xie ; Baihan Li ; Xuenan Xu ; Zheng Liang ; Kai Yu ; Mengyue Wu

With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset named FakeSound for deepfake general audio detection is proposed, and samples can be viewed on website https://FakeSoundData.github.io. The average binary accuracy of humans on all test sets is consistently below 0.6, which indicates the difficulty humans face in discerning deepfake audio and affirms the efficacy of the FakeSound dataset. A deepfake detection model utilizing a general audio pre-trained model is proposed as a benchmark system. Experimental results demonstrate that the performance of the proposed model surpasses the state-of-the-art in deepfake speech detection and human testers.

#7 Sound of Traffic: A Dataset for Acoustic Traffic Identification and Counting [PDF] [Copy] [Kimi] [REL]

Authors: Shabnam Ghaffarzadegan ; Luca Bondi ; Wei-Chang Lin ; Abinaya Kumar ; Ho-Hsiang Wu ; Hans-Georg Horst ; Samarjit Das

We introduce Sound of Traffic, the largest publicly available dataset for traffic identification and counting to date. With over 415 hours of multichannel acoustic traffic data recorded in six different locations, it encompasses varying levels of traffic density and environmental conditions. In this work, we discuss strategies for automatic collection and alignment of large amount of labeled data, leveraging existing asynchronous urban sensors such as radar, cameras, and inductive coils. In addition to the dataset, we propose a simple baseline system for vehicle counting divided by type of the vehicle (passenger vs. commercial vehicle) and direction of travel (right-to-left and left-to-right), a fundamental task for traffic analysis. The dataset and baseline system serve as a starting point for researchers to develop more advanced algorithms and models in this field. The dataset can be accessed at https://zenodo.org/records/10700792 and https://zenodo.org/records/11209838.

#8 Multi-mic Echo Cancellation Coalesced with Beamforming for Real World Adverse Acoustic Conditions [PDF] [Copy] [Kimi] [REL]

Authors: Premanand Nayak ; Kamini Sabu ; M. Ali Basha Shaik

Robust acoustic echo cancellation (AEC) is essential for voice enabled smart devices. Multi-channel signals are used in AEC along with beamformer (BF) for better residual echo suppression (RES). In this work, we introduce a deep neural network (DNN) based novel unified framework for multi-microphone AEC (MMAEC) and RES under adverse signal-to-echo (SER) conditions. We propose the use of deep-MVDR which uses deep steering vector and deep power spectral density (Deep PSD) estimator to conceptually implement minimum variance distortionless beamformer. We also introduce additional novelty in our framework by jointly training the MMAEC and deep-MVDR modules. Both of these methods give consistent significant improvement in ERLE which is further enriched by the incorporation of playback reconstruction loss. Our system outperforms competitive baselines while being robust in adverse real-world conditions such as very low input SER, dominant far-end sources, and moving near-end speech sources.

#9 Interference Aware Training Target for DNN based joint Acoustic Echo Cancellation and Noise Suppression [PDF] [Copy] [Kimi] [REL]

Authors: Vahid Khanagha ; Dimitris Koutsaidis ; Kaustubh Kalgaonkar ; Sriram Srinivasan

Despite remarkable performance of Deep Learning based Acoustic Echo Cancellation (AEC) systems, effective handling of double-talk scenarios remains a challenge. During double-talk the speech signal from the far-end talker overlaps with the target near-end speech and results in degraded performance in form of near-end speech deletions or audible echo residuals shadowing the voice of target speaker. This paper introduces an approach to reduce the shadowing effect through altering the ground truth used for model training such that the model can effectively clean up spectral components where the interference is stronger than the target speech. The alteration is accomplished leveraging the availability of interference signal during training data generation by masking spectral components of the ground truth where target speech is significantly weaker than the interference. Large scale subjective evaluation trials show that human listeners prefer the outputs generated by the new approach.

#10 Low Complexity Echo Delay Estimator Based on Binarized Feature Matching [PDF] [Copy] [Kimi] [REL]

Authors: Yi Gao ; Xiang Su

Echo delay estimation (EDE) serves as a preprocessing component within an acoustic echo canceller (AEC). Despite some progress over the past few decades, there is a dearth of literature on efficient algorithms. This paper introduces a binarized feature-matching (BFM) framework, encompassing a set of feature extraction methods, which are compared with traditional methods such as the GCC-Phat-based method, the adaptive-filter-based method, and popular methods published in WebRTC projects, in terms of both complexity and performance. The computational loads of the BFM methods are significantly lower, and a hybrid BFM method further enhances performance in terms of convergence speed and robustness. This method, characterized by its low complexity, benefits both traditional and NN-based AEC.

#11 MSA-DPCRN: A Multi-Scale Asymmetric Dual-Path Convolution Recurrent Network with Attentional Feature Fusion for Acoustic Echo Cancellation [PDF] [Copy] [Kimi2] [REL]

Authors: Ye Ni ; Cong Pang ; Chengwei Huang ; Cairong Zou

Echo cancellation plays a crucial role in modern speech applications. Numerous deep-learning models have been developed for the echo cancellation task and achieved great progress by incorporating additional features; however, the majority of these models overlook the characteristics of different features and simply merge them along the channel dimension. In this paper, we proposed a multi-scale asymmetric dual-path convolution recurrent network (MSA-DPCRN) consisting of two asymmetric encoding paths to extract spectrum and relevant features from the input reference and microphone signals. Moreover, we propose a frequency-wise attentional feature fusion (AFF) method to fuse the two features while maintaining the original dynamic range. The experiments validate the effectiveness of each component in MSA-DPCRN and indicate that our model outperforms the AEC challenge baseline in terms of the Echo-MOS metrics.

#12 Efficient Joint Bemforming and Acoustic Echo Cancellation Structure for Conference Call Scenarios [PDF1] [Copy] [Kimi] [REL]

Authors: Ofer Schwartz ; Sharon Gannot

We propose an efficient scheme for combining beamformer (BF) and acoustic echo cancellation (AEC). We focus on conference call scenarios characterized by stationary background noise and multiple speakers who alternate frequently. Furthermore, aiming at low-resource devices, a common strategy is to apply a single AEC at the output of the BF rather than applying multiple AECs to each microphone signal. The main drawback of such a structure is the frequent change of the echo path due to the BF adaptation. To circumvent this problem, it is proposed to apply a single-channel pre-filter to the far-end signal, encompassing the BF weights and the relative acoustic responses between the reference microphone and all other microphones w.r.t. the echo loudspeaker. As a result, the AEC block becomes indifferent to the changes in the BF weights. The proposed scheme is evaluated using real recordings and was found advantageous over standard combined AEC-BF schemes in terms of speed of convergence.

#13 SDAEC: Signal Decoupling for Advancing Acoustic Echo Cancellation [PDF] [Copy] [Kimi] [REL]

Authors: Fei Zhao ; Jinjiang Liu ; Xueliang Zhang

In deep learning-based acoustic echo cancellation methods, neural networks implicitly learn echo paths to cancel echoes. However, under low signal-to-echo ratio conditions, the substantial energy discrepancy between the microphone signal and the reference signal impedes the network's ability, resulting in poor performance. In this study, we propose a Signal Decoupling-based monaural Acoustic Echo Cancellation method called SDAEC. Specifically, we model the energy of the reference signal and the microphone signal to obtain an energy scaling factor. The reference signal is then multiplied by this energy scaling factor before being fed into the subsequent echo cancellation network. This approach reduces the difficulty of the subsequent echo cancellation step, thereby improving the overall cancellation performance. Experimental results demonstrate that the proposed method enhances the performance of multiple baseline models.

#14 Towards Explainable Monaural Speaker Separation with Auditory-based Training [PDF] [Copy] [Kimi] [REL]

Authors: Hassan Taherian ; Vahid Ahmadi Kalkhorani ; Ashutosh Pandey ; Daniel Wong ; Buye Xu ; DeLiang Wang

Permutation ambiguity is a major challenge in training monaural talker-independent speaker separation. While permutation invariant training (PIT) is a widely used technique, it functions as a `black box', providing little insight into which auditory cues lead to successful training. We introduce a new approach to speaker separation by leveraging differences in pitch and onset, which are both prominent cues for auditory scene analysis. We propose pitch-based and onset-based training to resolve permutation ambiguity, assigning speakers by their pitch frequencies and onset times, respectively. This approach offers a more explainable training strategy than PIT. We also propose a hybrid criterion combining these cues to improve separation performance in challenging conditions such as same-gender speakers or close utterance onsets. Evaluation results show that pitch and onset criteria each perform competitively to PIT and the hybrid criterion surpasses PIT in separating two-speaker mixtures.

#15 Does the Lombard Effect Matter in Speech Separation? Introducing the Lombard-GRID-2mix Dataset [PDF] [Copy] [Kimi] [REL]

Authors: Iva Ewert ; Marvin Borsdorf ; Haizhou Li ; Tanja Schultz

Inspired by the human ability of selective listening, speech separation aims to equip machines with the capability to disentangle cocktail party soundscapes into the individual sound sources. Recently, neural network based algorithms have been studied to work reliably under various conditions. However, to the best of our knowledge, a change in the speaking style has not yet been studied. The Lombard effect, a reflexive change in speaking style triggered by noisy environments, is a typical behavior in everyday conversational situations. In this work, we introduce a new first of its kind dataset, called Lombard-GRID-2mix, to study speech separation for two-speaker mixtures on normal speech and Lombard speech. In a comprehensive study, we show that speech separation systems can be equipped to work for both normal speech and Lombard speech. We apply a carefully designed finetuning method to enable the system to work even if noise is present in the Lombard speech for different SNR ratios.

#16 PARIS: Pseudo-AutoRegressIve Siamese Training for Online Speech Separation [PDF] [Copy] [Kimi] [REL]

Authors: Zexu Pan ; Gordon Wichern ; François G. Germain ; Kohei Saijo ; Jonathan Le Roux

While offline speech separation models have made significant advances, the streaming regime remains less explored and is typically limited to causal modifications of existing offline networks. This study focuses on empowering a streaming speech separation model with autoregressive capability, in which the current step separation is conditioned on separated samples from past steps. To do so, we introduce pseudo-autoregressive Siamese (PARIS) training: with only two forward passes through a Siamese-style network for each batch, PARIS avoids the training-inference mismatch in teacher forcing and the need for numerous autoregressive steps during training. The proposed PARIS training improves the recent online SkiM model by 1.5 dB in SI-SNR on the WSJ0-2mix dataset, with minimal change to the network architecture and inference time.

#17 OR-TSE: An Overlap-Robust Speaker Encoder for Target Speech Extraction [PDF1] [Copy] [Kimi] [REL]

Authors: Yiru Zhang ; Linyu Yao ; Qun Yang

Mainstream Target Speech Extraction (TSE) systems extract target speech from a mixture using pre-enrolled reference speech. The extraction performance heavily depends on the quality of the reference speech. However, the speech signal of the same speaker may vary under different conditions, leading to a decrease in extraction performance, particularly in speech overlap. Therefore, we propose an overlap robust speaker encoder for TSE to obtain stable speaker embeddings even when using signals with overlapping interference. Our approach combines attentive statistics pooling with contrastive learning to make the model focus on the feature of main speaker while disregarding interfering information. Based on our proposed speaker encoder, we introduce a TSE framework, which derive speaker embeddings from non-overlapping regions of mixture input. The experiments shows that our speaker encoder improves the performance of TSE in different conditions of reference speech.

#18 Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation [PDF] [Copy] [Kimi] [REL]

Authors: Tsun-An Hsieh ; Heeyoul Choi ; Minje Kim

Recent studies highlight the potential of textual modalities in conditioning the speech separation model's inference process. However, regularization-based methods remain underexplored despite their advantages of not requiring auxiliary text data during the test time. To address this gap, we introduce a timed text-based regularization (TTR) method that uses language model-derived semantics to improve speech separation models. Our approach involves two steps. We begin with two pretrained audio and language models, WavLM and BERT, respectively. Then, a Transformer-based audio summarizer is learned to align the audio and word embeddings and to minimize their gap. The summarizer Transformer, incorporated as a regularizer, promotes the separated sources' alignment with the semantics from the timed text. Experimental results show that the proposed TTR method consistently improves the various objective metrics of the separation results over the unregularized baselines.

#19 SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech [PDF] [Copy] [Kimi] [REL]

Authors: Jingru Lin ; Meng Ge ; Junyi Ao ; Liqun Deng ; Haizhou Li

It was shown that pre-trained models with self-supervised learning (SSL) techniques are effective in various downstream speech tasks. However, most such models are trained on single-speaker speech data, limiting their effectiveness in mixture speech. This motivates us to explore pre-training on mixture speech. This work presents SA-WavLM, a novel pre-trained model for mixture speech. Specifically, SA-WavLM follows an "extract-merge-predict" pipeline in which the representations of each speaker in the input mixture are first extracted individually and then merged before the final prediction. In this pipeline, SA-WavLM performs speaker-informed extractions with the consideration of the interactions between different speakers. Furthermore, a speaker shuffling strategy is proposed to enhance the robustness towards the speaker absence. Experiments show that SA-WavLM either matches or improves upon the state-of-the-art pre-trained models.

#20 TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information [PDF] [Copy] [Kimi] [REL]

Authors: Yiwen Wang ; Xihong Wu

Target sound extraction (TSE) separates the target sound from the mixture signals based on provided clues. However, the performance of existing models significantly degrades under reverberant conditions. Inspired by auditory scene analysis (ASA), this work proposes a TSE model provided with pitch information named TSE-PI. Conditional pitch extraction is achieved through the Feature-wise Linearly Modulated layer with the sound-class label. A modified Waveformer model combined with pitch information, employing a learnable Gammatone filterbank in place of the convolutional encoder, is used for target sound extraction. The inclusion of pitch information is aimed at improving the model's performance. The experimental results on the FSD50K dataset illustrate 2.4 dB improvements of target sound extraction under reverberant environments when incorporating pitch information and Gammatone filterbank.

#21 Enhanced Reverberation as Supervision for Unsupervised Speech Separation [PDF] [Copy] [Kimi] [REL]

Authors: Kohei Saijo ; Gordon Wichern ; François G. Germain ; Zexu Pan ; Jonathan Le Roux

Reverberation as supervision (RAS) is a framework that allows for training monaural speech separation models from multi-channel mixtures in an unsupervised manner. In RAS, models are trained so that sources predicted from a mixture at an input channel can be mapped to reconstruct a mixture at a target channel. However, stable unsupervised training has so far only been achieved in over-determined source-channel conditions, leaving the key determined case unsolved. This work proposes enhanced RAS (ERAS) for solving this problem. Through qualitative analysis, we found that stable training can be achieved by leveraging the loss term to alleviate the frequency-permutation problem. Separation performance is also boosted by adding a novel loss term where separated signals mapped back to their own input mixture are used as pseudo-targets for the signals separated from other channels and mapped to the same channel. Experimental results demonstrate high stability and performance of ERAS.

#22 Deep Echo Path Modeling for Acoustic Echo Cancellation [PDF] [Copy] [Kimi] [REL]

Authors: Fei Zhao ; Chenggang Zhang ; Shulin He ; Jinjiang Liu ; Xueliang Zhang

Acoustic echo cancellation (AEC) is a key audio processing technology that removes echoes from microphone inputs to enable natural-sounding full-duplex communication. In recent years, deep learning has shown great potential for advancing AEC. However, deep learning methods face challenges in generalizing to complex environments, especially unseen conditions not represented in training. In this paper, we propose a deep learning-based method to predict the echo path in the time-frequency domain. Specifically, we first estimate the echo path under single-talk scenario without near-end signal and then utilize these predicted echo paths as auxiliary labels to train the model on double-talk scenario with near-end signal. Experimental results show that our method outperforms the strong baselines and exhibits good generalization capabilities for unseen acoustic scenarios. By estimating the echo path using deep learning, this work advances AEC performance in the presence of complex conditions.

#23 Graph Attention Based Multi-Channel U-Net for Speech Dereverberation With Ad-Hoc Microphone Arrays [PDF] [Copy] [Kimi] [REL]

Authors: Hongmei Guo ; Yijiang Chen ; Xiao-Lei Zhang ; Xuelong Li

Speech dereverberation with ad-hoc microphone arrays seems not studied sufficiently, particularly in the scenario where the reverberation time is large. In this paper, we propose a novel multi-channel U-Net model for speech dereverberation with ad-hoc microphone arrays, where an attention module is integrated into the model in an end-to-end training manner to do channel selection and fusion. Specifically, we first train a single-channel U-Net model. Then, we replicate the U-Net model to each channel. Finally, we train the attention module for aggregating the information of the channels, where the parameters of the U-Net model are fixed at this stage. To our knowledge, this is the first work that U-Net was used for dereverberation with ad-hoc microphone arrays. We studied two attention mechanism, which are the self-attention and graph-attention; moreover, we integrated the attention module into either the bottleneck layer or the output layer of the multi-channel U-Net, which results in four implementations. Experimental results demonstrate that the proposed method achieves the state-of-the-art performance, and the attention module is very important in channel selection and fusion for improving the performance against long reverberation time.

#24 Speech dereverberation constrained on room impulse response characteristics [PDF] [Copy] [Kimi] [REL]

Authors: Louis Bahrman ; Mathieu Fontaine ; Jonathan Le Roux ; Gaël Richard

Single-channel speech dereverberation aims at extracting a dry speech signal from a recording affected by the acoustic reflections in a room. However, most current deep learning-based approaches for speech dereverberation are not interpretable for room acoustics, and can be considered as black-box systems in that regard. In this work, we address this problem by regularizing the training loss using a novel physical coherence loss which encourages the room impulse response (RIR) induced by the dereverberated output of the model to match the acoustic properties of the room in which the signal was recorded. Our investigation demonstrates the preservation of the original dereverberated signal alongside the provision of a more physically coherent RIR.

#25 DeWinder: Single-Channel Wind Noise Reduction using Ultrasound Sensing [PDF] [Copy] [Kimi] [REL]

Authors: Kuang Yuan ; Shuo Han ; Swarun Kumar ; Bhiksha Raj

The quality of audio recordings in outdoor environments is often degraded by the presence of wind. Mitigating the impact of wind noise on the perceptual quality of single-channel speech remains a significant challenge due to its non-stationary characteristics. Prior work in noise suppression treats wind noise as a general background noise without explicit modeling of its characteristics. In this paper, we leverage ultrasound as an auxiliary modality to explicitly sense the airflow and characterize the wind noise. We propose a multi-modal deep-learning framework to fuse the ultrasonic Doppler features and speech signals for wind noise reduction. Our results show that DeWinder can significantly improve the noise reduction capabilities of state-of-the-art speech enhancement models.