| Total: 164
This study compares probabilistic predictors based on information theory with Naive Discriminative Learning (NDL) predictors in modeling acoustic word duration, focusing on probabilistic reduction. We examine three models using the Buckeye corpus: one with NDL-derived predictors using information-theoretic formulas, one with traditional NDL predictors, and one with N-gram probabilistic predictors. Results show that the N-gram model outperforms both NDL models, challenging the assumption that NDL is more effective due to its cognitive motivation. However, incorporating information-theoretic formulas into NDL improves model performance over the traditional model. This research highlights a) the need to incorporate not only frequency and contextual predictability but also average contextual predictability, and b) the importance of combining information-theoretic metrics of predictability and information derived from discriminative learning in modeling acoustic reduction.
This paper examines how contextual predictability (surprisal) influences acoustic features in Polish speech using the PRODIS dataset. The study analyzes connected read speech from Wikipedia texts on history, politics, culture, and science, extracting surprisal values from a phoneme-based language model. Results show that high surprisal increases acoustic distinctiveness, such as longer segment duration and larger vowel space, while low surprisal reduces distinctiveness. We also find effects of text topic, lexical frequency, and lexical stress on surprisal. These findings highlight the complex interplay between predictability, discourse, and prosody in speech. The work contributes to understudied analyses on predictability and acoustics in Slavic languages.
Wheeldon & Lahiri reported that speech initiation time (RT) was influenced by the phonological complexity of the initial prosodic word (PWd) in an immediate-production task, but by the number of PWds in a delayed-production task that enabled preplanning. Thus, different tasks are sensitive to different types of complexity. The current study explored RT for different complexity types: number of syllables in the PWd (monosyllabic vs disyllabic), location of a complex PWd (early vs late), and number of PWds (3 vs. 4 vs. 5) using a reading aloud task. The presence of a disyllabic noun increased RT only in 4-PWd sentences, where the syntactic complexity of the SUBJECT and OBJECT NPs differed. This suggests that RT may increase in the absence of parallel syntactic structures in SUBJECT and OBJECT position, highlighting the need for further research on the role of complexity at different levels in the speech planning process.
While extensive studies have explored acoustic focus realization in Mandarin, little is known about how focus affects the prosodic phrasing of Mandarin complex nominals. This study examined how contrastive focus influences syllable duration of Mandarin numeral-classifier-noun phrases. Using a mini-dialogue paradigm, we elicited contrastive focus of different spans, alongside a baseline no-focus condition. Two production experiments revealed that when focus was placed on the numeral, the default prosodic grouping was disrupted, dissimilar to when focus encompasses the entire phrase or in neutral contexts, with tonal factors amplifying the reorganization. Our results indicate that prosodic organization in Mandarin is shaped by tones, morphosyntactic structures, focus marking and their interplay. Crucially, the results challenge rigid models of boundary phrasing and disyllabic footing, highlighting a multilevel interaction among phonetics, prosodic phrasing, and syntax in tonal languages.
In emotion recognition from speech, a key challenge lies in identifying speech signal segments that carry the most relevant acoustic variations for discerning specific emotions. Traditional approaches compute functionals for features such as energy and F0 over entire sentences or longer speech portions, potentially missing essential fine-grained variation in the long-form statistics. This research investigates the use of word informativeness, derived from a pre-trained language model, to identify semantically important segments. Acoustic features are then computed exclusively for these identified segments, enhancing emotion recognition accuracy. The methodology utilizes standard acoustic prosodic features, their functionals, and self-supervised representations. Results indicate a notable improvement in recognition performance when features are computed on segments selected based on word informativeness, underscoring the effectiveness of this approach.
The realization of palato-alveolar affricates in Italian varies regionally. While affricates have undergone deaffrication in Tuscan and southern varieties (/ˈt͡ʃeː.na/ → [ˈʃeː.na]), northern varieties are traditionally described as retaining the affricate. We provide acoustic and articulatory (EMA) evidence of an incipient lenition process in non-deaffricating varieties. This process is stress-conditioned as it is blocked in post-tonic affricates. Five of fifteen speakers deaffricated far-from-stress affricates in nonce words, as indicated by higher RMS amplitude and energy during the closure, while post-tonic affricates were preserved. Ten speakers did not exhibit stress-conditioned deaffrication. Both groups showed a longer acoustic closure duration, caused by delayed articulatory target achievement in post-tonic position. We discuss the phonological implications of these findings and their potential role in sound change.
This study evaluated a simulation model for sound radiation from the vocal tract wall in the context of articulatory speech synthesis. In this model, the vocal tract is represented in terms of incremental contiguous tube sections, where the wall of each section reacts to the local sound pressure like a damped spring-mass systems. The surface-radiated sound from this model was compared with that of six real speakers by means of the voicebars of /b,d,g/ in different vowel contexts. Based on this, the wall parameters of the real speakers were estimated by fitting the simulated to the real voicebars. The parameter optimization allowed a close reproduction of the natural voicebar spectra with root-mean-square errors between 2.26 dB and 3.82 dB in the frequency range from 0-800 Hz. Hence, despite its simplicity, the modeling method is very well suited for articulatory speech synthesis.
We introduce a new FROST-EMA (Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography) corpus. It consists of 18 bilingual speakers, who produced speech in their native language (L1), second language (L2), and imitated L2 (fake foreign accent). The new corpus enables research into language variability from phonetic and technological points of view. Accordingly, we include two preliminary case studies to demonstrate both perspectives. The first case study explores the impact of L2 and imitated L2 on the performance of an automatic speaker verification system, while the second illustrates the articulatory patterns of one speaker in L1, L2, and a fake accent.
Current methods of automated speech-based cognitive assessment often rely on fixed-picture descriptions in major languages, limiting repeatability, engagement, and locality. This paper introduces HK-GenSpeech (HKGS), a framework using generative AI to create pictures that present similar features to those used in cognitive assessment, augmented with descriptors reflecting the local context. We demonstrate HKGS through a dataset of 423 Cantonese speech samples collected in Hong Kong from 141 participants, with HK-MoCA scores ranging from 11 to 30. Each participant described the cookie theft picture, an HKGS fixed image, and an HKGS dynamic image. Regression experiments show comparable accuracy for all image types, indicating HKGS' adequacy in generating relevant assessment images. Lexical analysis further suggests that HKGS images elicit richer speech. By mitigating learning effects and improving engagement, HKGS supports broader data collection, particularly in low-resource settings.
This study explores the potential of Rhythm Formant Analysis (RFA) to capture long-term temporal modulations in dementia speech. Specifically, we introduce RFA-derived rhythm spectrograms as novel features for dementia classification and regression tasks. We propose two methodologies: (1) handcrafted features derived from rhythm spectrograms, and (2) a data-driven fusion approach, integrating proposed RFA-derived rhythm spectrograms with vision transformer (ViT) for acoustic representations along with BERT-based linguistic embeddings. We compare these with existing features. Notably, our handcrafted features outperform eGeMAPs with a relative improvement of 14.2% in classification accuracy and comparable performance in the regression task. The fusion approach also shows improvement, with RFA spectrograms surpassing Mel spectrograms in classification by around a relative improvement of 13.1% and a comparable regression score with the baselines. All codes are available in GitHub repo.
This paper presents our submission to the PROCESS Challenge 2025, focusing on spontaneous speech analysis for early dementia detection. For the three-class classification task (Healthy Control, Mild Cognitive Impairment, and Dementia), we propose a cascaded binary classification framework that fine-tunes pre-trained language models and incorporates pause encoding to better capture disfluencies. This design streamlines multi-class classification and addresses class imbalance by restructuring the decision process. For the Mini-Mental State Examination score regression task, we develop an enhanced multimodal fusion system that combines diverse acoustic and linguistic features. Separate regression models are trained on individual feature sets, with ensemble learning applied through score averaging. Experimental results on the test set outperform the baselines provided by the organizers in both tasks, demonstrating the robustness and effectiveness of our approach.
Alzheimer's Disease (AD) poses a growing global health challenge due to population aging. Using spontaneous speech for the early diagnosis of AD has emerged as a notable area of research. In response to the global trend of AD, our study proposes a speech-based multilingual AD detection method. In our study, we utilize Whisper for transfer learning to build a multilingual pre-trained AD diagnostic model that achieves 81.38% accuracy on a test set comprising multiple languages. To enhance low-resource language performance, we fine-tune the pre-trained model with multilingual data and full transcripts as prompts, achieving a 4-7% accuracy improvement. Additionally, we incorporate the speaker's background information, enhancing the accuracy of low-resource languages by 11-13%. The results demonstrate the validity of our work in multilingual Alzheimer's detection tasks and also illustrate the feasibility of our approach in addressing the global need for Alzheimer's detection.
Alzheimer's Disease (AD) is a neurodegenerative condition characterized by linguistic impairments. While ASR and LLMs show promise in AD detection, ASR often normalizes key AD-related speech patterns and faces cross-lingual challenges due to language dependencies. Besides, ASR training demands extensive matched data. In our paper, however, we employ a phoneme recognizer as a frontend tokenizer. Provided it has comprehensive phoneme coverage, a multitude of linguistic phenomena can be represented via phoneme sequences, including hesitations, repetitions, pauses, mispronunciations, and even distinctions between different language identities that are crucial for AD detection. Furthermore, the BERT model is employed to extract high-dimensional features from the Phonetic PosteriorGrams (PPGs), which are ultimately used to diagnose Alzheimer's disease. Our approach offers cross-lingual applicability, achieves competitive accuracy, and maintains computational efficiency.
Multi-Task Learning (MTL) is widely used in automatic stuttering detection to identify stuttering symptoms; however, task conflicts can hinder performance. This paper addresses the task conflict issue in MTL-based stuttering detection and proposes a rule-based MTL strategy and a Multi-Mixture-of-Experts (MMoE) MTL framework to alleviate these conflicts. We analyze the inherent conflicts in stuttering detection tasks and develop a rule-based MTL strategy to mitigate them. Additionally, we introduce an MoE-based adaptive multi-task strategy to optimize task allocation. Experimental results show that our approach outperforms current state-of-the-art methods. In the 2024 SLT Stuttering Speech Challenge, the rule-based MTL strategy achieved a 19.9% increase in average F1 score over the baseline, securing first place. The MMoE-MTL strategy further enhanced task collaboration, improving the average F1 score by 7.55%, demonstrating its effectiveness.
We explore language-agnostic deep text embeddings for severity classification of dysarthria in Amyotrophic Lateral Sclerosis (ALS). Speech recordings are transcribed by human and ASR and embeddings of the transcripts are considered. Though speech recognition accuracy has been studied for grading dysarthria severity, no effort has yet been made to utilize text embeddings of the transcripts. We perform severity classification at different granularity (2, 3, and 5-class) using data obtained from 47 ALS subjects. Experiments with dense neural network based classifiers suggest that, though text features achieve nearly equal performances as baseline speech features, like statistics of mel frequency cepstral coefficients (MFCC), for 2-class classification, speech features outperform for higher number of classes. Concatenation of text embeddings and MFCC statistics attains the best performances with mean F1 scores of 88%, 68%, and 53%, respectively, in 2, 3, and 5-class classification.
Detecting and segmenting dysfluencies is crucial for effective speech therapy and real-time feedback. However, most methods only classify dysfluencies at the utterance level. We introduce StutterCut, a semi-supervised framework that formulates dysfluency segmentation as a graph partitioning problem, where speech embeddings from overlapping windows are represented as graph nodes. We refine the connections between nodes using a pseudo-oracle classifier trained on weak (utterance-level) labels, with its influence controlled by an uncertainty measure from Monte Carlo dropout. Additionally, we extend the weakly labelled FluencyBank dataset by incorporating frame-level dysfluency boundaries for four dysfluency types. This provides a more realistic benchmark compared to synthetic datasets. Experiments on real and synthetic datasets show that StutterCut outperforms existing methods, achieving higher F1 scores and more precise stuttering onset detection.
Post-stroke speech disorders impair communication and rehabilitation outcomes, often requiring prolonged, intensive therapy sessions. The diversity of symptoms, coupled with the high cost and logistical burden of traditional speech therapy, underscores the need for accurate, automatic assessment to support tailored interventions. Leveraging a purpose-built database of stroke patients, this study introduces a feature-driven framework integrating traditional acoustic features with physiologically informed glottal parameters for classifying impaired speech after stroke. Evaluating unimodal, combined, and SHAP-derived feature configurations, our approach achieved a 97% F1-score in distinguishing pathological from healthy speech. These results highlight the potential of combining clinically meaningful glottal and acoustic information to support early speech deterioration detection, enhancing accessibility and personalised rehabilitation strategies for improved patient outcomes.
The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.
Fairness in speech processing systems is a critical challenge, especially regarding performance disparities based on speakers' backgrounds. To help combat this problem, we are introducing FaiST (Fairness in Speech Technology), a novel speech dataset from American English speakers of various racial, ethnic, and national origin groups. The goal is to evaluate and mitigate possible biases across speech technologies using conversational speech from podcasts and interviews online. In FaiST's current version, speakers self-identified as Asian American and African American, and future iterations will include other groups. White American speakers' speech was extracted from VoxCeleb, and their demographic labels were obtained online. In addition to identifiers of race, ethnicity, and national origins, FaiST is also marked for the exact instances in the conversation where self-identifications occurred. We experimented with FaiST and found racial bias in eighteen Automatic Speech Recognition systems.
In recent years, there has been a growing focus on fairness and inclusivity within speech technology, particularly in areas such as automatic speech recognition and speech sentiment analysis. When audio is transcoded prior to processing, as is the case in streaming or real-time applications, any inherent bias in the coding mechanism may result in disparities. This not only affects user experience but can also have broader societal implications by perpetuating stereotypes and exclusion. Thus, it is important that audio coding mechanisms are unbiased. In this work, we contribute towards the scarce research with respect to language and gender biases of audio codecs. By analyzing the speech quality of over 2 million multilingual audio files after transcoding through a representative subset of codecs (PSTN, VoIP and neural), our results indicate that PSTN codecs are strongly biased in terms of gender and that neural codecs introduce language biases.
Speech enhancement models have traditionally relied on VoiceBank-DEMAND for training and evaluation. However, this dataset presents significant limitations due to its limited diversity and simulated noise conditions. As an alternative, we propose and demonstrate the usefulness of evaluating the generalization capabilities of recent speech enhancement models using CommonPhone, a multilingual and crowdsourced dataset. Since CommonPhone is derived from CommonVoice, it allows to analyze enhancement performance based on demographic variables such as age and gender. Our experiments reveal significant performance variations across these variables. We also introduce a new benchmark dataset designed to challenge enhancement models with difficult and diverse speech samples, facilitating future research in universal speech enhancement.
Speech pauses serve as a valuable and non-invasive biomarker for the early detection of dementia. Our study aims to examine abnormal pauses, specifically their durations, for improving the detection performance. Inspired by the proven performance of the Transformer-based models in dementia detection, we opted for integrating the abnormal pauses into these models. Specifically, we enriched the inputs for the Transformer-based models by fusing between-segment pause context into the automated transcriptions. We performed the experiments on our Cantonese elderly corpus called CU-Marvel. To improve the detection performance, we optimized the pause durations when infusing the pause context into the transcriptions. Our findings suggest that the between-segment pauses could also serve as promising biomarkers. We emphasize the importance of optimizing pause patterns across different languages or datasets. Our findings indicate that various across different languages or datasets.
Whisper fails to correctly transcribe dementia speech because persons with dementia (PwDs) often exhibit irregular speech patterns and disfluencies such as pauses, repetitions, and fragmented sentences. It was trained on standard speech and may have had little or no exposure to dementia-affected speech. However, correct transcription is vital for dementia speech for cost-effective diagnosis and the development of assistive technology. In this work, we fine-tune Whisper with the open-source dementia speech dataset (DementiaBank) and our in-house dataset to improve its word error rate (WER). The fine-tuning also includes filler words to ascertain the filler inclusion rate (FIR) and F1 score. The fine-tuned models significantly outperformed the off-the-shelf models. The medium-sized model achieved a WER of 0.24, outperforming previous work. Similarly, there was a notable generalisability to unseen data and speech patterns.
This work describes a comprehensive approach for the automatic assessment of cognitive decline from spontaneous speech in the context of the PROCESS Challenge 2025. Based on our previous experience on the use of speech and text-derived biomarkers for disease detection, we evaluate here the use of knowledge-based acoustic and text-based feature sets, as well as LLM-based macro-descriptors, and multiple neural representations (e.g., Longformer, ECAPA-TDNN, and Trillsson embeddings). The combination of these feature sets with different classifiers resulted in a large pool of systems, from which, those providing the best balance between train, development, and individual class performance were selected for model ensembling. Our final best-performing systems correspond to combinations of models that are complementary to each other, relying on acoustic and textual information from the three clinical tasks provided in the challenge dataset.
Alzheimer’s dementia (AD) is a neurodegenerative disorder with cognitive decline that commonly impacts language ability. This work extends the paired perplexity approach to detecting AD by using a recent large language model (LLM), the instruction-following version of Mistral-7B. We improve accuracy by an average of 3.33% over the best current paired perplexity method and by 6.35% over the top-ranked method from the ADReSS 2020 challenge benchmark. Our further analysis demonstrates that the proposed approach can effectively detect AD with a clear and interpretable decision boundary in contrast to other methods that suffer from opaque decision-making processes. Finally, by prompting the fine-tuned LLMs and comparing the model-generated responses to human responses, we illustrate that the LLMs have learned the special language patterns of AD speakers, which opens up possibilities for novel methods of model interpretation and data augmentation.