| Total: 152
Speaker diarization systems segment a conversation recording based on the speakers' identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlapping speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity, such as speaker-adapted speech recognition. One of the ways to mitigate these errors is to provide segment-level diarization confidence scores to downstream systems. In this work, we investigate multiple methods for generating diarization confidence scores, including those derived from the original diarization system and those derived from an external model. Our experiments across multiple datasets and diarization systems demonstrate that the most competitive confidence score methods can isolate 30% of the diarization errors within segments with the lowest 10% of confidence scores.
End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker embedding framework alongside the EEND module. In this paper, we propose a novel framework applying EEND both locally and globally for long-form audio without separate speaker embeddings. This approach achieves significant relative DER reduction of 13% and 10% over the conventional 1-pass EEND on Callhome American English and RT03-CTS datasets respectively and marginal improvements over EEND-vector-clustering without the need for additional speaker embeddings. Furthermore, we discuss the computational complexity of our proposed framework and explore strategies for reducing processing times.
While standard speaker diarization attempts to answer the question "who spoke when", many realistic applications are interested in determining "who spoke what". In both the conventional modularized approach and the more recent end-to-end neural diarization (EEND), an additional automatic speech recognition (ASR) model and an orchestration algorithm are required to associate speakers with recognized words. In this paper, we propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network, a multi-task learning algorithm that performs end-to-end ASR and speaker diarization in the same architecture by sharing blank logits. Such a framework allows easily adding diarization capabilities to any existing RNN-T based ASR models without Word Error Rate (WER) regressions. Experimental results demonstrate that WEEND outperforms a strong turn-based diarization baseline system on all 2-speaker short-form scenarios, with the capability to generalize to audio lengths of 5 minutes.
In this paper, we make the explicit connection between image segmentation methods and end-to-end diarization methods. From these insights, we propose a novel, fully end-to-end diarization model, EEND-M2F, based on the Mask2Former architecture. Speaker representations are computed in parallel using a stack of transformer decoders, in which irrelevant frames are explicitly masked from the cross attention using predictions from previous layers. EEND-M2F is efficient, and truly end-to-end, eliminating the need for additional segmentation models or clustering algorithms. Our model achieves state-of-the-art performance on several public datasets, such as AMI, AliMeeting and RAMC. Most notably our DER of 16.07% on DIHARD-III is the first major improvement upon the challenge winning system.
Speaker diarization in real-world videos presents significant challenges due to varying acoustic conditions, diverse scenes, the presence of off-screen speakers, etc. This paper builds upon a previous study (AVR-Net) and introduces a novel multi-modal speaker diarization system, AFL-Net. The proposed AFL-Net incorporates dynamic lip movement as an additional modality to enhance the identity distinction. Besides, unlike AVR-Net which extracts high-level representations from each modality independently, AFL-Net employs a two-step cross-attention mechanism to sufficiently fuse different modalities, resulting in more comprehensive information to enhance the performance. Moreover, we also incorporated a masking strategy during training, where the face and lip modalities are randomly obscured. This strategy enhances the impact of the audio modality on the system outputs. Experimental results demonstrate that AFL-Net outperforms state-of-the-art baselines, such as the AVR-Net and DyViSE.
Advancements in diarization have prompted the development of supervised learning models. These models extract fixed-length embeddings from audio files of varying lengths. Despite challenges, commercial API models like Speechbrain, Resemblyzer, Whisper AI, and Pyannote have addressed this issue. However, these models typically utilize Mel-Frequency Cepstral Coefficients (MFCC) features, convolution layers, and dimension reduction techniques to create embeddings. Our proposal method introduces a Wavelet Scattering Transform (WST) that prioritizes information content, allowing users to customize the shape of embeddings according to their model requirements. Coupling WST with AutoEncoders (WST-AE) in a residual manner enhances semantic latent space representations, which can be clustered segment-wise in an unsupervised manner. Testing on AMI and VoxConverse datasets has shown a reduction in Diarization Error Rate (DER) with fewer training parameters and without the need for separate embedding models.
Audio denoising, especially in the context of bird sounds, remains a challenging task due to persistent residual noise. Traditional and deep learning methods often struggle with artificial or low-frequency noise. In this work, we propose ViTVS, a novel approach that leverages the power of the vision transformer (ViT) architecture. ViTVS adeptly combines segmentation techniques to disentangle clean audio from complex signal mixtures. Our key contributions encompass the development of ViTVS, introducing comprehensive, long-range, and multi-scale representations. These contributions directly tackle the limitations inherent in conventional approaches. Extensive experiments demonstrate that ViTVS outperforms state-of-the-art methods, positioning it as a benchmark solution for real-world bird sound denoising applications. Source code is available at: https://github.com/aiai-4/ViVTS.
In ornithology, bird species are known to have variedit’s widely acknowledged that bird species display diverse dialects in their calls across different regions. Consequently, computational methods to identify bird species onsolely through their calls face critsignificalnt challenges. There is growing interest in understanding the impact of species-specific dialects on the effectiveness of bird species recognition methods. Despite potential mitigation through the expansion of dialect datasets, the absence of publicly available testing data currently impedes robust benchmarking efforts. This paper presents the Dialect Dominated Dataset of Bird Vocalisation (D3BV), the first cross-corpus dataset that focuses on dialects in bird vocalisations. The D3BV comprises more than 25 hours of audio recordings from 10 bird species distributed across three distinct regions in the contiguous United States (CONUS). In addition to presenting the dataset, we conduct analyses and establish baseline models for cross-corpus bird recognition. The data and code are publicly available online: https://zenodo.org/records/11544734
With the advent of pre-trained self-supervised learning (SSL) models, speech processing research is showing increasing interest towards disentanglement and explainability. Amongst other methods, probing speech classifiers has emerged as a promising approach to gain new insights into SSL models out-of-domain performances. We explore knowledge transfer capabilities of pre-trained speech models with vocalizations from the closest living relatives of humans: non-human primates. We focus on classifying the identity of northern grey gibbons (Hylobates funereus) from their calls with probing and layer-wise analysis of state-of-the-art SSL speech models compared to pre-trained bird species classifiers and audio taggers. By testing the reliance of said models on background noise and timewise information, as well as performance variations across layers, we propose a new understanding of the mechanisms underlying speech models efficacy as bioacoustic tools.
Phonocardiogram classification methods using deep neural networks have been widely applied to the early detection of cardiovascular diseases recently. Despite their excellent recognition rate, the sizeable computational complexity limits their further development. Nowadays, knowledge distillation (KD) is an established paradigm for model compression. While current research on multi-teacher KD has shown potential to impart more comprehensive knowledge to the student than single-teacher KD, this approach is not suitable for all scenarios. This paper proposes a novel KD strategy to realise an adaptive multi-teacher instruction mechanism. We design a teacher selection strategy called voting network to tell the contribution of different teachers on each distillation points, so that the student can choose the useful information and renounce the redundant one. An evaluation demonstrates that our method reaches excellent accuracy (92.8%) while maintaining a low computational complexity (0.7M).
Obstructive Sleep Apnea-Hypopnea Syndrome (OSAHS) is a prevalent chronic breathing disorder caused by upper airway obstruction. Previous studies advanced OSAHS evaluation through machine learning-based systems trained on sleep snoring or speech signal datasets. However, constructing datasets for training a precise and rapid OSAHS evaluation system poses a challenge, since 1) it is time-consuming to collect sleep snores and 2) the speech signal is limited in reflecting upper airway obstruction. In this paper, we propose a new snoring dataset for OSAHS evaluation, named SimuSOE, in which a novel and time-effective snoring collection method is introduced for tackling the above problems. In particular, we adopt simulated snoring which is a type of snore intentionally emitted by patients to replace natural snoring. Experimental results indicate that the simulated snoring signal during wakefulness can serve as an effective feature in OSAHS preliminary screening.
This paper proposes an ensembling model as spoofed speech countermeasure, with a particular focus on synthetic voice. Despite the recent advances in speaker verification based on deep neural networks, this technology is still susceptible to various malicious attacks, so that some kind of countermeasures are needed. While an increasing number of anti-spoofing techniques can be found in the literature, the combination of multiple models, or ensemble models, still proves to be one of the best approaches. However, current iterations often rely on fixed weight assignments, potentially neglecting the unique strengths of each individual model. In response, we propose a novel ensembling model, an adaptive neural network-based approach that dynamically adjusts weights based on input utterances. Our experimental findings show that this approach outperforms traditional weighted score averaging techniques, showcasing its ability to adapt to diverse audio characteristics effectively.
This paper defines Spoof Diarization as a novel task in the Partial Spoof (PS) scenario. It aims to determine what spoofed when, which includes not only locating spoof regions but also clustering them according to different spoofing methods. As a pioneering study in spoof diarization, we focus on defining the task, establishing evaluation metrics, and proposing a bench- mark model, namely the Countermeasure-Condition Clustering (3C) model. Utilizing this model, we first explore how to effectively train countermeasures to support spoof diarization using three labeling schemes. We then utilize spoof localization predictions to enhance the diarization performance. This first study reveals the high complexity of the task, even in restricted scenarios where only a single speaker per audio file and an oracle number of spoofing methods are considered. Our code is available at https://github.com/ nii-yamagishilab/PartialSpoof.
In this work, a dual-branch network is proposed to exploit both local and global information of utterances for spoofing speech detection (SSD). The local artifacts of spoofing speech can reside in specific temporal or spectral regions, which are the primary objectives for SSD systems. We propose a spectro-temporal graph attention network to jointly capture the temporal and spectral differences of the spoofing speech. It is different from existing methods that the proposed method exploits the cross attention mechanism to bridge the spectro-temporal dependency. As the global artifacts can also provide complimentary information for SSD, we use a BiLSTM-based branch to modeling temporal long-term discriminative clues. These two branches are then separately optimized with the weighted cross-entropy loss, and the scores are fused at equal weights. Results on three benchmark datasets (i.e., ASVspoof 2019, 2021 LA and 2021 DF) reveal the superiority of the proposed method over advanced systems.
The rapid development of speech synthesis algorithms poses a challenge in constructing corresponding training datasets for speech anti-spoofing systems in real-world scenarios. The copy-synthesis method offers a simple yet effective solution to this problem. However, the limitation of this method is that it only utilizes the artifacts generated by vocoders, neglecting those from acoustic models. This paper aims to locate the artifacts introduced by the acoustic models of Text-to-Speech (TTS) and Voice Conversion (VC) algorithms, and optimize the copy-synthesis pipeline. The proposed rhythm and speaker perturbation modules successfully boost anti-spoofing models to leverage the artifacts introduced by acoustic models, thereby enhancing their generalization ability when facing various TTS and VC algorithms.
Automatic Speaker Verification (ASV) is extensively used in many security-sensitive domains, but the increasing prevalence of adversarial attacks has seriously compromised the trustworthiness of these systems. Targeted black-box attacks emerge as the most formidable threat, proving incredibly challenging to counteract. However, existing defenses exhibit limitations when applied in real-world scenarios. We propose VoiceDefense - a novel adversarial sample detection method that slices an audio sample into multiple segments and captures their local audio features with segment-specific ASV scores. These scores present distributions that vary distinctly between genuine and adversarial samples, which VoiceDefense leverages for detection. VoiceDefense outperforms the state of the art with a best AUC of 0.9624 and is consistently effective against various attacks and perturbation budgets, all while maintaining remarkably low computational overhead.
Automatic Speaker Verification (ASV), increasingly used in security-critical applications, faces vulnerabilities from rising adversarial attacks, with few effective defenses available. In this paper, we propose a neural codec-based adversarial sample detection method for ASV. The approach leverages the codec's ability to discard redundant perturbations and retain essential information. Specifically, we distinguish between genuine and adversarial samples by comparing ASV score differences between original and re-synthesized audio (by codec models). This comprehensive study explores all open-source neural codecs and their variant models for experiments. The Descript-audio-codec model stands out by delivering the highest detection rate among 15 neural codecs and surpassing seven prior state-of-the-art (SOTA) detection methods. Note that, our single-model method even outperforms a SOTA ensemble method by a large margin.
Adversarial attacks introduce subtle perturbations to audio signals for causing automatic speaker verification (ASV) systems to make mistakes. To address this challenge, adversarial purification techniques have emerged, where diffusion models have been proven effective. However, the latest development with the diffusion models caused a negative effect that the audio generation quality is not high enough. Moreover, these approaches tend to focus solely on audio features, while often neglecting textual information. To overcome these limitations, we propose a textual-driven adversarial purification (TDAP) framework, which integrates diffusion models with pretrained large audio language models for comprehensive defense. TDAP employs textual data extracted from audio to guide the diffusion-based purification process. Extensive experimental results show that TDAP significantly enhances the defense robustness against adversarial attacks.
In the black-box attack for speaker recognition systems, the adversarial examples can exhibit better transferability for unseen victim system if they can consistently spoof an ensemble of substitute models. In this work, we propose a gradient-aligned ensemble attack method to find the optimal gradient direction to update the adversarial example using a set of substitute models. Specifically, we first calculate the overfitting-reduced gradient for each substitute model by randomly masking some regions of the input acoustic features. Then we obtain the weight of the gradient for each substitute model based on the consistency of its gradient with respect to others. The final update gradient is calculated by the weighted sum of the gradients over all substitute models. Experimental results on the VoxCeleb dataset verify the effectiveness of the proposed approach for the speaker identification and speaker verification tasks.
Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA’s capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.
Recent advances in foundation models have enabled audio-generative models that produce high-fidelity sounds associated with music, events, and human actions. Despite the success achieved in modern audio-generative models, the conventional approach to assessing the quality of the audio generation relies heavily on distance metrics like Frechet Audio Distance. In contrast, we aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data. Specifically, we conduct studies to explore the use of synthetic audio for audio recognition. Moreover, we investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling. Our comprehensive experiments demonstrate the potential of using synthetic audio for audio recognition and speech-related modeling. Our code is available at https://github.com/usc-sail/SynthAudio.
Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music, and environmental information, as shown in nearest-neighbor classification experiments.
Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state space models, which have demonstrated promising results for language modelling. However, their feasibility for learning self-supervised, general-purpose audio representations is yet to be investigated. This work proposes Audio Mamba, a selective state space model for learning general-purpose audio representations from randomly masked spectrogram patches through self-supervision. Empirical results on ten diverse audio recognition downstream tasks show that the proposed models, pretrained on the AudioSet dataset, consistently outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by a considerable margin and demonstrate better performance in dataset size, sequence length and model size comparisons.
Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively.
Sound event detection is the task of recognizing sounds and determining their extent (onset/offset times) within an audio clip. Existing systems commonly predict sound presence posteriors in short time frames. Then, thresholding produces binary frame-level presence decisions, with the extent of individual events determined by merging presence in consecutive frames. In this paper, we show that frame-level thresholding deteriorates event extent prediction by coupling it with the system’s sound presence confidence. We propose to decouple the prediction of event extent and confidence by introducing sound event bounding boxes (SEBBs), which format each sound event prediction as a combination of a class type, extent, and overall confidence. We also propose a change-detection-based algorithm to convert frame-level posteriors into SEBBs. We find the algorithm significantly improves the performance of DCASE 2023 Challenge systems, boosting the state of the art from .644 to .686 PSDS1.