| Total: 143
Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the generalizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre-trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state-of-the-art systems from speech and singing domains.
Despite significant advances in ASR, the specific acoustic cues models rely on remain unclear. Prior studies have examined such cues on a limited set of phonemes and outdated models. In this work, we apply a feature attribution technique to identify the relevant acoustic cues for a modern Conformer-based ASR system. By analyzing plosives, fricatives, and vowels, we assess how feature attributions align with their acoustic properties in the time and frequency domains, also essential for human speech perception. Our findings show that the ASR model relies on vowels’ full time spans, particularly their first two formants, with greater saliency in male speech. It also better captures the spectral characteristics of sibilant fricatives than non-sibilants and prioritizes the release phase in plosives, especially burst characteristics. These insights enhance the interpretability of ASR models and highlight areas for future research to uncover potential gaps in model robustness.
Most modern approaches for audio processing are opaque, in the sense that they do not provide an explanation for their decisions. For this reason, various methods have been proposed to explain the outputs generated by these models. Good explanations can result in interesting insights about the data or the model, as well as increase trust in the system. Unfortunately, evaluating the quality of explanations is far from trivial since, for most tasks, there is no clear ground truth explanation to use as reference. In this work, we propose a benchmark for time-localized explanations for audio classification models that uses time annotations of target events as a proxy for ground truth explanations. We use this benchmark to systematically optimize and compare various approaches for model-agnostic post-hoc explanation, obtaining, in some cases, close to perfect explanations. Finally, we illustrate the utility of the explanations for uncovering spurious correlations.
Audio DNNs have demonstrated impressive performance on various machine listening tasks; however, most of their representations are computationally costly and uninterpretable, leaving room for optimization. Here, we propose a novel approach centered on spectrotemporal modulation (STM) features, a signal processing method that mimics the neurophysiological representation in the human auditory cortex. The classification performance of our STM-based model, without any pretraining, is comparable to that of pretrained audio DNNs across diverse naturalistic speech, music, and environmental sounds, which are essential categories for both human cognition and machine perception. These results show that STM is an efficient and interpretable feature representation for audio classification, advancing the development of machine listening and unlocking exciting new possibilities for basic understanding of speech and auditory sciences, as well as developing audio BCI and cognitive computing.
In this study, we gained insight that contributes to achieving accent-robust ASR using only native speech data. In human perception of non-native speech, the phenomenon known as "interlanguage speech intelligibility benefit" (ISIB) is observed, where non-native listeners who share the native language with the speaker understand the speech better compared even to native listeners. Based on the idea that discrete tokens extracted from self-supervised learning (SSL) models represent the human perception of speech, we conducted an analytical study on the robustness of discrete token-based ASR to non-native speech, varying the language used for training the tokenization, which is viewed as a technical implementation of ISIB. The results showed that ISIB actually occurred in the discrete token-based ASR. Since our approach relies only on native speech data to simulate the behavior of human perception, it is expected to be applicable to a wide range of accents for which speech data is scarce.
Techniques for discrete audio representation, which convert an audio signal into a sequence of audio tokens using neural audio codecs or self-supervised speech models, have gained attention for offering the possibility of modeling audio with large language models (LM) efficiently. While these audio tokens have been studied in various domains (e.g., speech, music, and general sound), their encoding properties across domains remain unclear. This paper examines several audio token types to analyze cross-domain variations. Our major findings include that audio tokens exhibit consistent statistical structures and probabilistic predictability deduced from rank-frequency distribution and perplexity, regardless of the domain. However, the token usage pattern is somewhat domain-dependent. This result underpins the steady success of the versatile audio LM, while also suggesting that domain-aware LM could further optimize performance by better capturing domain-specific token usage distributions.
Self-supervised learning has been widely used in developing speech foundation models. Most languages, however, are only represented in multilingual foundations. We introduce monolingual self-supervised foundation models pre-trained on more than 150,000 hours of Finnish speech and propose a new interpretation technique to understand their capabilities. To our knowledge, this is the largest monolingual data used for self-supervised non-English speech representation learning. Our models demonstrate superior downstream low-resource ASR performance and improved generalization compared to prior work, with absolute WER reductions of up to 14%. Moreover, our proposed interpretation technique, Layer Utilization Rate (LUR), enables us to assess the percentage of neurons in each layer highly contributing towards the output. Empirical results show that the proposed LUR metric can be used to indicate the potential of the fine-tuned model's size and architecture to generalize to unseen domains.
Music auto-tagging is essential for organizing and discovering music in extensive digital libraries. While foundation models achieve exceptional performance in this domain, their outputs often lack interpretability, limiting trust and usability for researchers and end-users alike. In this work, we present an interpretable framework for music auto-tagging that leverages groups of musically meaningful multimodal features, derived from signal processing, deep learning, ontology engineering, and natural language processing. To enhance interpretability, we cluster features semantically and employ an expectation maximization algorithm, assigning distinct weights to each group based on its contribution to the tagging process. Our method achieves competitive tagging performance while offering a deeper understanding of the decision-making process, paving the way for more transparent and user-centric music tagging systems.
The emergence of large language models has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts - showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. To support reproducibility, we have released our code along with a curated audio version of the SST-2 dataset for public access.
Modern neural speech models benefit from having longer context, and many approaches have been proposed to increase the maximum context a model can use. However, few have attempted to measure how much context these models actually use, i.e., the effective context. Here, we propose two approaches to measuring the effective context, and use them to analyze different speech Transformers. For supervised models, we find that the effective context correlates well with the nature of the task, with fundamental frequency tracking, phone classification, and word classification requiring increasing amounts of effective context. For self-supervised models, we find that effective context increases mainly in the early layers, and remains relatively short---similar to the supervised phone model. Given that these models do not use a long context during prediction, we show that HuBERT can be run in streaming mode without modification to the architecture and without further fine-tuning.
In this paper we study word stress representations learned by self-supervised speech models (S3M), specifically the Wav2vec 2.0 model. We investigate the S3M representations of word stress for five different languages: Three languages with variable or lexical stress (Dutch, English and German) and two languages with fixed or demarcative stress (Hungarian and Polish). We train diagnostic stress classifiers on S3M embeddings and show that they can distinguish between stressed and unstressed syllables in read-aloud short sentences with high accuracy. We also tested language-specificity effects of S3M word stress. The results indicate that the word stress representations are language-specific, with a greater difference between the set of variable versus the set of fixed stressed languages.
How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pretraining exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.
Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations remains under-studied. This study investigates two such models, HuBERT and wav2vec 2.0, and minimally compares two of their architectural differences: training objective and iterative pseudo-label refinement through multiple training iterations. We find that differences in canonical correlation of hidden representations to word identity, phoneme identity, and speaker identity are explained by training iteration, not training objective. We suggest that future work investigate the reason for the effectiveness of iterative refinement in encoding linguistic information in self-supervised speech representations.
As the capabilities of large-scale pre-trained models evolve, understanding the determinants of their outputs becomes more important. Feature attribution aims to reveal which parts of the input elements contribute the most to model outputs. In speech processing, the unique characteristics of the input signal make the application of feature attribution methods challenging. We study how factors such as input type and aggregation and perturbation timespan impact the reliability of standard feature attribution methods, and how these factors interact with characteristics of each classification task. We find that standard approaches to feature attribution are generally unreliable when applied to the speech domain, with the exception of word-aligned perturbation methods when applied to word-based classification tasks.
Early diagnosis and intervention are crucial for mild cognitive impairment (MCI), as MCI often progresses to more severe neurodegenerative conditions. In this study, we explore utilizing deep learning for MCI detection without loosing the interpretability provided by feature-based approaches. We used a dataset consisting of 90 MCI patients and 91 controls collected via a remote assessment platform and analyzed the participants’ spontaneous speech responses to the Patient Report of Problems (PROP) which asks patients to report their most bothersome general health problems. The proposed deep neural network, which features a bottleneck layer including 13 interpretable symptom domains, achieved an AUC of 0.62, thereby outperforming a set of feature-based classifiers while ensuring interpretability due to the bottleneck layer. We further illustrated the model’s interpretability by examining how the predicted PROP domains influence final predictions using Shapley values.
Gender-affirming voice training (GAVT) can reduce voice-gender incongruence for transgender and gender-diverse individuals, but access is often limited by high costs and a lack of qualified providers. Interactive software could expand access, but existing GAVT apps are typically limited in functionality. This paper describes the development and testing of software to address one of the most challenging aspects of GAVT: modifying vocal tract resonance. We introduce a biofeedback system that uses a real-time linear predictive coding (LPC) spectrum to visualize changes in resonance as learners adjust their vocal tract shape. Visual targets for brighter (feminine) and darker (masculine) resonances help guide users toward their desired voice characteristics. In-lab user testing with 10 trans women yielded an average System Usability Scale (SUS) score of 75.25, supporting the acceptability of the tool as an adjunct resonance training in the GAVT context.
A voice's gender is considered to be dictated by one's biology and cultural situation. Without modification, this determinism results in colinearity between acoustic metrics, making disentangling a metric's contribution to gender perception difficult. To study disentanglement on natural speech, we collaborate with a gender-affirming voice teacher to collect the Disentangled Source-Filter Dataset (DSFD): 45-minutes of audio along 25 Pitch, Resonance, and Weight voice configurations, coupled with Electroglottograph (EGG) measurements. Our analysis demonstrates certain acoustic and physical metrics, namely avg. $F_0$, $\Delta F$, Contact Quotient (CQ), and Loudness correlate with Pitch, Resonance, and Weight. Going on to perform perceptual studies of gender, naturalness, and realness, we see that $\Delta F$ is the strongest predictor of perceived gender. Perceived naturalness and realness of a voice, however, prove to be unpredictable by these acoustic metrics.
Trans people tend to be well aware of the ways in which (perceived) gender is indexically linked to the voice. Focusing on the understudied population of trans men, we present one of the first studies on style shift in trans speakers, considering the phonetic features of trans men's speech in different contexts, and their own beliefs about vocal cues to gender perception. Our participants (n = 7) provided read speech samples with contrasting imagined listeners, and discussed their voices and experiences in semi-structured interviews. Acoustic and qualitative analyses show that our participants adopted lower, narrower pitch ranges when they imagined speaking to a stranger, though individuals differed in the ways they altered their vowel formants and /s/ production. The interview data provide further insights into trans men's speech styles, self-monitoring, and pursuit of authenticity, highlighting shared concerns about voice training apps, safety, and representation for trans people.
Developing equitable and inclusive speech technologies requires datasets that represent the full spectrum of human voices, including those of LGBTQIA+ speakers. However, capturing spontaneous, high-quality audio from marginalized gender and sexual identities presents significant ethical, logistical, and representational challenges. This paper introduces Queer Waves, a German speech corpus compiled from podcast and YouTube content featuring self-identified LGBTQIA+ speakers, with a particular focus on diverse gender identities and sexual orientations. We further address the legal and ethical considerations inherent in collecting sensitive personal data. The Queer Waves corpus comprises approximately 335 hours of speech from over 400 self-identified LGBTQIA+ speakers, spanning ages from 18 to 86 years. By expanding representation across a wide range of gender identities and orientations, Queer Waves aims to advance the development of fairer and more accurate speech technologies.
This paper presents a large-scale dataset capturing Reddit comments with pronoun declarations in the respective user flairs, offering a new resource for studying linguistic identity, gender expression, and digital discourse. Totaling 72 million tokens, it contains all comments by pronoun-declaring users to present a broader view of their language use than previous corpora that selected isolated utterances. The dataset enables research across multiple domains, including (online) sociolinguistics, natural language processing (NLP), and other social sciences. It facilitates the study of pronoun-sharing behavior, the distribution and adoption of non-binary pronouns, and the use of mixed pronouns in online discourse. Future work can expand the dataset to capture more rare pronoun declarations; nevertheless, it provides a highly curated, valuable foundation for the study of online gender expression and discourse, innovative language, and identity performance in digital spaces.
Speech-generating devices (SGDs) provide users with text-to-speech (TTS) voices that shape identity and self-expression. Current TTS voices enable self-expression but often lack customizable features for authentic voice embodiment, particularly for nonbinary SGD users seeking gender affirmation as existing TTS voices largely reproduce binary, cisgender speech patterns. This study examines how nonbinary SGD users embody, or disembody, synthetic voices and the factors influencing voice affirmation. Through a survey, we analyze experiences of nonbinary SGD users and their impressions of generated speech samples, investigating the role of technological possibilities in gender affirmation and voice embodiment. Findings inform the creation of more user-centered TTS technologies, and challenge dominant paradigms in speech technology, gesturing toward a posthumanist rethinking of voice as co-constructed between human and machine.
The URGENT 2024 Challenge aims to foster speech enhancement (SE) techniques with great universality, robustness, and generalizability, featuring a broader task definition, large-scale multi-domain data, and comprehensive evaluation metrics. Nourished by the challenge outcomes, this paper presents an in-depth analysis of two key, yet understudied, issues in SE system development: data cleaning and evaluation metrics. We highlight several overlooked problems in traditional SE pipelines: (1) mismatches between declared and effective audio bandwidths, along with label noise even in various "high-quality" speech corpora; (2) lack of both effective SE systems to conquer the hardest conditions (e.g., speech overlap, strong noise / reverberation) and reliable measure of speech sample difficulty; (3) importance of combining multifaceted metrics for a comprehensive evaluation correlating well with human judgment. We hope that this endeavor can inspire improved SE pipeline designs in the future.
There has been a growing effort to develop universal speech enhancement (SE) to handle inputs with various speech distortions and recording conditions. The URGENT Challenge series aims to foster such universal SE by embracing a broad range of distortion types, increasing data diversity, and incorporating extensive evaluation metrics. This work introduces the Interspeech 2025 URGENT Challenge, the second edition of the series, to explore several aspects that have received limited attention so far: language dependency, universality for more distortion types, data scalability, and the effectiveness of using noisy training data. We received 32 submissions, where the best system uses a discriminative model, while most other competitive ones are hybrid methods. Analysis reveals some key findings: (i) some generative or hybrid approaches are preferred in subjective evaluations over the top discriminative model, and (ii) purely generative SE models can exhibit language dependency.
Universal speech enhancement aims to handle input speech with different distortions and input formats. To tackle this challenge, we present TS-URGENet, a Three-Stage Universal, Robust, and Generalizable speech Enhancement Network. To address various distortions, the proposed system employs a novel three-stage architecture consisting of a filling stage, a separation stage, and a restoration stage. The filling stage mitigates packet loss by preliminarily filling lost regions under noise interference, ensuring signal continuity. The separation stage suppresses noise, reverberation, and clipping distortion to improve speech clarity. Finally, the restoration stage compensates for bandwidth limitation, codec artifacts, and residual packet loss distortion, refining the overall speech quality. Our proposed TS-URGENet achieves outstanding performance in the Interspeech 2025 URGENT Challenge, ranking 2nd in Track 1.
During audio transmission and processing, various distortions may occur. To effectively address this challenge, we developed a multistage universal speech enhancement system, consisting of four submodules, namely audio declipping, packet loss compensation, audio separation, and spectral inpainting. These modules operate across the time, sub-band, and time-frequency domains. We employed a pretrain-finetune training paradigm and introduce a self-distillation method to further improve performance. Experiments on large-scale datasets demonstrate that our system outperforms in multiple evaluation metrics, particularly in improving subjective speech quality. The proposed system ranked 1st in the URGENT 2024 challenge with a MOS of 3.52 and placed 4th in the second track of the URGENT 2025 challenge.