| Total: 111
In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in developing generic audio-language models. In this study, we present MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instruction tuning. MINT leverages the strength of frozen pre-trained audio encoders and large language models (LLM) to improve audio-language pretraining, enabling effective transferablility to both audio-text understanding and generation tasks. To address the modality gap, we introduce Bridge-Net, a trainable module that enhances cross-modality alignment and the model’s ability to follow in- structions for a variety of audio-text tasks. Bridge-Net is pivotal within MINT, initially enhancing audio-language representation learning through a multi-target pre-training approach. Subsequently, Bridge-Net further boosts audio-to-language generative learning by integrating a frozen language model with instruction tuning. This integration empowers MINT to extract features in a flexible and effective manner, specifically tailored to the provided instructions for diverse tasks. Experimental results demonstrate that MINT attains superior performance across various audio-language understanding and generation tasks, highlighting its robust generalization capabilities even in zero-shot scenarios.
Contrastive language-audio pre-training (CLAP) enables zero-shot (ZS) inference of audio and exhibits promising performance in several classification tasks. However, conventional audio representations are still crucial for many tasks where ZS is not applicable (e.g., regression problems). Here, we explore a new representation, a general-purpose audio-language representation, that performs well in both ZS and transfer learning. To do so, we propose a new method, M2D-CLAP, which combines self-supervised learning Masked Modeling Duo (M2D) and CLAP. M2D learns an effective representation to model audio signals, and CLAP aligns the representation with text embedding. As a result, M2D-CLAP learns a versatile representation that allows for both ZS and transfer learning. Experiments show that M2D-CLAP performs well on linear evaluation, fine-tuning, and ZS classification with a GTZAN state-of-the-art of 75.17%, thus achieving a general-purpose audio-language representation.
This paper proposes an audio fingerprinting model with holographic reduced representation (HRR). The proposed method reduces the number of stored fingerprints, whereas conventional neural audio fingerprinting requires many fingerprints for each audio track to achieve high accuracy and time resolution. We utilize HRR to aggregate multiple fingerprints into a composite fingerprint via circular convolution and summation, resulting in fewer fingerprints with the same dimensional space as the original. Our search method efficiently finds a combined fingerprint in which a query fingerprint exists. Using HRR's inverse operation, it can recover the relative position within a combined fingerprint, retaining the original time resolution. Experiments show that our method can reduce the number of fingerprints with modest accuracy degradation while maintaining the time resolution, outperforming simple decimation and summation-based aggregation methods.
In the film industry, audio-video synchronization issues are considered major quality defects and key drivers of viewer disengagement. This is especially true for dubbed content, which is more prone to these errors due to the added manual process of replacing the original speech with a translated version. Despite their potential benefit for dubbed media production, automatic sync detection methods are seldom explored. In this paper, we propose a Transformer-based Siamese network for dubbed audio synchronization detection. Based on a large dataset of dubbed entertainment, we demonstrate that, compared to previous methods, our approach is more robust in detecting the misalignment introduced by translated speech segments. While our method addresses the previously studied constant synchronization errors, our model is the first to handle the frequent issue of intermittent offsets.
Pitch estimation is of fundamental importance in audio processing and music information retrieval. YOLO is a well developed model designed for image target detection. Here we introduce YOLOv7 into pitch estimation task and improve by proposing time-frequency (TF) dual-branch into the model according to pitch perception of human auditory. An additional advantage of the model over the state-of-the-art (SOTA) models is that it only needs to add an unvoiced class without additional unvoiced/voiced detection to achieve joint pitch estimation and voiced determination. Experiments show for both music and speech, the proposed TF dual-branch can boost pitch estimation accuracy over the back-bone. Our model exhibits superior pitch estimation performance over the SOTA and shows minimal performance degradation in noisy condition. The overall accuracy on the MDB-stem-synth dataset peaks at 99.4%, and voicing determination F-score reaches 99.9%.
Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. A common approach is cross-lingual pre-training. Instead, we propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition. Our comparisons found that a combined synthetic augmentations (noise/pitch) strategy outperformed accent and language knowledge transfer. Furthermore, we examined the scaling factor of augmented data to achieve equivalent performance to model pre-trained with target domain speech. Our findings suggest that for resource-constrained languages, combined augmentations can be a viable option than other augmentations.
We develop two complementary advances for training no-reference (NR) speech quality estimators with independent datasets. Multi-dataset finetuning (MDF) pretrains an NR estimator on a single dataset and then finetunes it on multiple datasets at once, including the dataset used for pretraining. AlignNet uses an AudioNet to generate intermediate score estimates before using the Aligner to map intermediate estimates to the appropriate score range. AlignNet is agnostic to the choice of AudioNet so any successful NR speech quality estimator can benefit from its Aligner. The methods can be used in tandem, and we use two studies to show that they improve on current solutions: one study uses nine smaller datasets and the other uses four larger datasets. AlignNet with MDF improves on other solutions because it efficiently and effectively removes misalignments that impair the learning process, and thus enables successful training with larger amounts of more diverse data.
Mispronunciation Detection and Diagnosis (MDD) systems, leveraging Automatic Speech Recognition (ASR), face two main challenges in Mandarin Chinese: 1) The two-stage models create an information gap between the phoneme or tone classification stage and the MDD stage. 2) The scarcity of Mandarin MDD datasets limits model training. In this paper, we introduce a stateless RNN-T model for Mandarin MDD, utilizing HuBERT features with pitch embedding through a Pitch Fusion Block. Our model, trained solely on native speaker data, shows a 3% improvement in Phone Error Rate and a 7% increase in False Acceptance Rate over the state-of-the-art baseline in non-native scenarios.
Pronunciation assessment models designed for open response scenarios enable users to practice language skills in a manner similar to real-life communication. However, previous open-response pronunciation assessment models have predominantly focused on a single pronunciation task, such as sentence-level accuracy, rather than offering a comprehensive assessment in various aspects. We propose MultiPA, a Multitask Pronunciation Assessment model that provides sentence-level accuracy, fluency, prosody, and word-level accuracy assessment for open responses. We examined the correlation between different pronunciation tasks and showed the benefits of multi-task learning. Our model reached the state-of-the-art performance on existing in-domain data sets and effectively generalized to an out-of-domain dataset that we newly collected. The experimental results demonstrate the practical utility of our model in real-world applications.
Traditional phoneme-level goodness of pronunciation (GOP) methods require phoneme to speech alignment. The drawback is that these methods, by their definitions, are prone to alignment errors and preclude the possibility of deletion and insertion errors in pronunciation. We produce experimental evidence that CTC-based methods can be used in traditional GOP estimation in spite of their “peaky” output behaviour and may be less prone to alignment errors than traditional methods. We also propose a new framework for GOP estimation based on CTC-trained model that is independent of speech-phoneme alignment. By accounting for deletion and insertions as well as substitution errors, we show that our framework outperform alignment-based method. Our experimental results are based on the CMU-kids dataset for child speech and on the Speechocean762 containing both child and adult speech speakers. Our best method achieves 29.02% relative improvement over the baseline GOP methods.
The automatic identification and analysis of pronunciation errors, known as mispronunciation detection and diagnosis (MDD), is vital in computer-aided pronunciation learning (CAPL) tools for second-language (L2) learning. Existing MDD methods focus on analyzing phonemes, but they can only detect categorical errors for phonemes with sufficient training data. Due to the unpredictable nature of non-native speakers’ pronunciation errors and limited training datasets, modelling all mispronunciations becomes impractical. Additionally, phoneme-level MDD approaches provide limited diagnostic information. In our proposed approach, we detect phonological features, breaking down phoneme production into elementary components related to the articulatory system, offering more informative feedback to learners. Applied to L2 English speech data, it outperformed traditional phoneme-level methods, reducing false acceptance rate (FAR), false rejection rate (FRR), and diagnostic error rate (DER).
In automated pronunciation assessment, recent emphasis progressively lies on evaluating multiple aspects to provide enriched feedback. However, acquiring multi-aspect-score labeled data for non-native language learners' speech poses challenges; moreover, it often leads to score-imbalanced distributions. In this paper, we propose two Acoustic Feature Mixup strategies, linearly and non-linearly interpolating with the in-batch averaged feature, to address data scarcity and score-label imbalances. Primarily using goodness-of-pronunciation as an acoustic feature, we tailor mixup designs to suit pronunciation assessment. Further, we integrate fine-grained error-rate features by comparing speech recognition results with the original answer phonemes, giving direct hints for mispronunciation. Effective mixing of the acoustic features notably enhances overall scoring performances on the speechocean762 dataset, and detailed analysis highlights our potential to predict unseen distortions.
We propose a framework to address several unsolved challenges in second language (L2) automatic speaking assessment (ASA) and feedback. The challenges include: 1. ASA of visual task completion, 2. automated content grading and explanation of spontaneous L2 speech, 3. corrective feedback generation for L2 learners, and 4. all the above for a language that has minimal speech data of L2 learners. The proposed solution combines visual natural language generation (NLG), automatic speech recognition (ASR) and prompting a large language model (LLM) for low-resource L2 learners. We describe the solution and the outcomes of our case study for a picture description task in Finnish. Our results indicate substantial agreement with human experts in grading, explanation and feedback. This framework has the potential for a significant impact in constructing next-generation computer-assisted language learning systems to provide automatic scoring with feedback for learners of low-resource languages.
Based on a medium-sized sample of English investor-oriented business-idea presentations (so-called “investor pitches”), the present paper investigates the links between speech rhythm and perceived speaker charisma. Eight trained public speakers are recorded while performing the same investor pitch twice, once in an emotionally-neutral matter-of-fact fashion and once charismatically, i.e. in an expressive, committed onstage presentation style. The recorded presentations were rated by 21 listeners for their degree of perceived speaker charisma – and additionally acoustically analyzed in terms of established duration-based rhythm measures such as ∅, Δ, and PVI. We find significant rhythmic differences between the matter-of-fact and charismatic presentation performances and, in conjunction with the perception results, we show that consonantal rhythmic elements play a bigger role in the perception than in the production of a rhythmic charisma, and that especially the duration variation of larger rhythm elements correlates positively and gender-independently with charisma ratings. The findings are discussed in light of previous studies with their practical implications.
This study investigates the acoustic features of sarcasm and disentangles the interplay between the propensity of an utterance being used sarcastically and the presence of prosodic cues signaling sarcasm. Using a dataset of sarcastic utterances compiled from television shows, we analyze the prosodic features within utterances and key phrases belonging to three distinct sarcasm categories (embedded, propositional, and illocutionary), which vary in the degree of semantic cues present, and compare them to neutral expressions. Results show that in phrases where the sarcastic meaning is salient from the semantics, the prosodic cues are less relevant than when the sarcastic meaning is not evident from the semantics, suggesting a trade-off between prosodic and semantic cues of sarcasm at the phrase level. These findings highlight a lessened reliance on prosodic modulation in semantically dense sarcastic expressions and a nuanced interaction that shapes the communication of sarcastic intent.
Recognizing a speaker's level of commitment to a belief is a difficult task; humans do not only interpret the meaning of the words in context, but also understand cues from intonation and other aspects of the audio signal. Many papers and corpora in the NLP community have approached the belief prediction task using text-only approaches. We are the first to frame and present results on the multimodal belief prediction task. We use the CB-Prosody corpus (CBP), containing aligned text and audio with speaker belief annotations. We first report baselines and significant features using acoustic-prosodic features and traditional machine learning methods. We then present text and audio baselines for the CBP corpus fine-tuning on BERT and Whisper respectively. Finally, we present our multimodal architecture which fine-tunes on BERT and Whisper and uses multiple fusion methods, improving on both modalities alone.
Empathy is the ability to understand another’s feelings as if we were having those feelings ourselves. It has been shown to increase to people’s trust and likability. Much research has been done on creating empathetic responses in text in conversational systems, yet little work has been done to identify the acoustic-prosodic speech features that can create an empathetic-sounding voice. Our contributions include 1) collection of a new empathy speech dataset, 2) identifying interpretable acoustic-prosodic features that contribute to empathy expression and 3) benchmarking the empathy detection task.
Counseling is an activity of conversational speaking between a therapist and a client. Therapist empathy is an essential indicator of counseling quality and assessed subjectively by considering the entire conversation. This paper proposes to encode long counseling conversation using a hierarchical attention network. Conversations with extreme values of empathy rating are used to train a Siamese network based encoder with contrastive loss. Two-level attention mechanisms are applied to learn the importance weights of individual speaker turns and groups of turns in the conversation. Experimental results show that the use of contrastive loss is effective in encouraging the conversation encoder to learn discriminative embeddings that are related to therapist empathy. The distances between conversation embeddings positively correlate with the differences in the respective empathy scores. The learned conversation embeddings can be used to predict the subjective rating of therapist empathy.
Language ability at an old age is a balance between preservation and decline. Modelling baseline language variation in normal aging thus is important for our understanding of healthy aging, which can help detect cognitive impairments at the prodromal stage. Large-language databases and NLP tools enable us to conduct automated quantitative analysis of natural language data. In this study, we aim to demonstrate that (i) age and sex influence old adults’ lexical distribution and lexical concreteness; and (ii) using NLP tools and psycholinguistic metrics to process natural language datasets can help to set a normative benchmark of aging languages.
In emergency medicine, timely intervention for patients at risk of suicide is often hindered by delayed access to specialised psychiatric care. To bridge this gap, we introduce a speech-based approach for automatic suicide risk assessment. Our study involves a novel dataset comprising speech recordings of 20 patients who read neutral texts. We extract four speech representations encompassing interpretable and deep features. Further, we explore the impact of gender-based modelling and phrase-level normalisation. By applying gender-exclusive modelling, features extracted from an emotion fine-tuned wav2vec2.0 model can be utilised to discriminate high- from low suicide risk with a balanced accuracy of 81%. Finally, our analysis reveals a discrepancy in the relationship of speech characteristics and suicide risk between female and male subjects. For men in our dataset, suicide risk increases together with agitation while voice characteristics of female subjects point the other way.
The paper introduces a new device for analyzing, teaching, and training jaw movements: the MARRYS helmet. We outline the motivation for the development of the helmet, describe its key advantages and features relative to those of the Electromagnetic Articulograph (EMA) and illustrate by means of selected study portraits the possible uses of the MARRYS helmet in the various fields of the empirical and applied speech sciences.
This work is concerned with devising a robust Parkinson's (PD) disease detector from speech in real-world operating conditions using (i) foundational models, and (ii) speech enhancement (SE) methods. To this end, we first fine-tune several foundational-based models on the standard PC-GITA (s-PC-GITA) clean data. Our results demonstrate superior performance to previously proposed models. Second, we assess the generalization capability of the PD models on the extended PC-GITA (e-PC-GITA) recordings, collected in real-world operative conditions, and observe a severe drop in performance moving from ideal to real-world conditions. Third, we align training and testing conditions applaying off-the-shelf SE techniques on e-PC-GITA, and a significant boost in performance is observed only for the foundational-based models. Finally, combining the two best foundational-based models trained on s-PC-GITA, namely WavLM Base and Hubert Base, yielded top performance on the enhanced e-PC-GITA.
Chronic obstructive pulmonary disease (COPD) is a serious inflammatory lung disease affecting millions of people around the world. Due to an obstructed airflow from the lungs, it also becomes manifest in patients' vocal behaviour. Of particular importance is the detection of an exacerbation episode, which marks an acute phase and often requires hospitalisation and treatment. Previous work has shown that it is possible to distinguish between a pre- and a post-treatment state using automatic analysis of read speech. In this contribution, we examine whether sustained vowels can provide a complementary lens for telling apart these two states. Using a cohort of 50 patients, we show that the inclusion of sustained vowels can improve performance to up to 79% unweighted average recall, from a 71% baseline using read speech. We further identify and interpret the most important acoustic features that characterise the manifestation of COPD in sustained vowels.
Automatic pathological speech detection relies on deep learning (DL), showing promising performance for various pathologies. Despite the critical importance of robustness in healthcare applications like pathological speech detection, the sensitivity of DL-based pathological speech detection approaches to adversarial attacks remains unexplored. This paper explores the impact of acoustically imperceptible adversarial perturbations on DL-based pathological speech detection. Imperceptibility of perturbations, generated using the projected gradient descent algorithm, is evaluated using speech enhancement metrics. Results reveal a high vulnerability of DL-based pathological speech detection to adversarial perturbations, with adversarial training ineffective in enhancing robustness. Analysis of the perturbations provide insights into the speech components that the approaches attend to. These findings highlight the need for research in robust pathological speech detection.
Addressing speech sound disorders (SSD) in early childhood is pivotal for mitigating cognitive and communicative impediments. Previous works on automatic SSD detection rely on audio features without considering the age and speaker bias which results in degraded performance. In this paper, we propose an SSD detection system in which debiasing techniques are applied to mitigate the biases. For the age bias, we use a multi-head model where the feature extractor is shared across different age groups but the final decision is made using the age-dependent classifier. For the speaker bias, we augment the dataset by mixing the audios of the multiple speakers in the same age group. When evaluated with our Korean SSD dataset, the proposed method showed significant improvements over previous approaches.