| Total: 106
Responsible AI may not be a consensuous concept and the list of the so called pillars may not be uniquely defined either. Nonetheless, their message is clear and urgent. In this talk, I’ll address some of the pillars of responsible speech processing, focusing on privacy, explainability (namely for health applications), fairness/inclusion and sustainability. Rather than attempting a comprehensive survey of all the efforts in these directions, I will present my own perspective of how these pillars should inform the next generation of speech research.
We present a novel benchmark dataset and prediction tasks for investigating approaches to assess cognitive function through analysis of connected speech. The dataset consists of speech samples and clinical information for speakers of Mandarin Chinese and English with different levels of cognitive impairment as well as individuals with normal cognition. These data have been carefully matched by age and sex by propensity score analysis to ensure balance and representativity in model training. The prediction tasks encompass mild cognitive impairment diagnosis and cognitive test score prediction. This framework was designed to encourage the development of approaches to speech-based cognitive assessment which generalise across languages. We illustrate it by presenting baseline prediction models that employ language-agnostic and comparable features for diagnosis and cognitive test score prediction. Unweighted average recall was 59.2% in diagnosis, and root mean squared error was 2.89 in score prediction.
Cognitive decline is a natural process that occurs as individuals age. Early diagnosis of anomalous decline is crucial for initiating professional treatment that can enhance the quality of life of those affected. To address this issue, we propose a multimodal model capable of predicting Mild Cognitive Impairment and cognitive scores. The TAUKADIAL dataset is used to conduct the evaluation, which comprises audio recordings of clinical interviews. The proposed model demonstrates the ability to transcribe and differentiate between languages used in the interviews. Subsequently, the model extracts audio and text features, combining them into a multimodal architecture to achieve robust and generalized results. Our approach involves in-depth research to implement various features obtained from the proposed modalities.
Shared tasks or challenges provide valuable opportunities for the machine learning community, as they offer a chance to compare the performance of machine learning approaches without peeking (due to the hidden test set). We present the approach of our team for the Interspeech'24 TAUKADIAL Challenge, where the task is to distinguish patients of Mild Cognitive Impairment (MCI) from healthy controls based on their speech. Our workflow focuses entirely on the acoustics, mixing standard feature sets (ComParE functionals and wav2vec2 embeddings) and custom attributes focusing on the amount of silent and filled pause segments. By training dedicated SVM classifiers on the three speech tasks and combining the predictions over the different speech tasks and feature sets, we obtained F1 values of up to 0.76 for the MCI identification task using cross-validation, while our RMSE scores for the MMSE estimation task were as low as 2.769 (cross-validation) and 2.608 (test).
Effective diagnosis of Mild Cognitive Impairment (MCI), a preclinical stage of cognitive decline, is significant for delaying disease progression. While most current spontaneous speech-based diagnostic methods focus on English speech, the Interspeech 2024 TAUKADIAL Challenge proposed an innovative research direction to develop a language-agnostic approach to diagnose MCI. This paper proposes an MCI diagnosis method by analyzing and combining linguistic and acoustic features using the bilingual Chinese-English speech dataset provided by the challenge. We employed a pre-trained multilingual model and expressivity encoder to extract language-agnostic speech features. To overcome the challenges of data scarcity and language diversity, we implemented data augmentation and alignment to enhance the model's generalization. Our approach achieved 77.5% accuracy, demonstrating its effectiveness and potential on cross-lingual data.
Mild cognitive impairment (MCI) and dementia significantly impact millions worldwide and rank as a major cause of mortality. Since traditional diagnostic methods are often costly and result in delayed diagnoses, many efforts have been made to propose automatic detection approaches. However, most methods focus on monolingual cases, limiting the scalability of their models to individuals speaking different languages. To understand the common characteristics of people with MCI speaking different languages, we propose a multilingual MCI detection model using multimodal approaches that analyze both acoustic and linguistic features. It outperforms existing machine learning models by identifying universal MCI indicators across languages. Particularly, we find that speech duration and pauses are crucial in detecting MCI in multilingual settings. Our findings can potentially facilitate early intervention in cognitive decline across diverse linguistic backgrounds.
This study examines the suitability of language-agnostic features for automatically detecting Mild Cognitive Impairment (MCI) and predicting Mini-Mental State Examination (MMSE) scores in a multilingual framework. We explored two methods for feature extraction: traditional feature engineering and pre-trained feature representation. We developed our models using the Interspeech 2024 Taukadial challenge data set, containing audios from subjects with MCI and controls in Chinese and English. Our top ensemble model achieved 75% accuracy in MCI detection and an RMSE of 2.44 in MMSE prediction in the testing set. Our results reveal the complementary nature of acoustic and linguistic representations and the existence of universal features that can be used cross-lingually. However, a statistical analysis of interpretable features did not show any shared speech patterns between the two languages, which can be attributed to differences in disease severity between the two cohorts of participants.
Mild Cognitive Impairment (MCI) is considered a prodromal stage of dementia, including Alzheimer’s disease. It is characterized by behavioral changes and decreased cognitive function, while individuals can still maintain their independence. Early detection of MCI is critical, as it allows for timely intervention, enrichment of clinical trial cohorts, and the development of therapeutic approaches. Recently, language markers have been shown to be a promising approach to identifying MCI in a non-intrusive, affordable, and accessible fashion. In the InterSpeech 2024 TAUKADIAL Challenge, we study language markers from spontaneous speech in English and Chinese and use the bilingual language markers to identify MCI cases and predict the Mini-Mental Status Examination (MMSE) scores. Our proposed framework combines the power from 1) feature extraction of a comprehensive set of bilingual acoustic features, and semantic and syntactic features from language models; 2) careful treatment of model complexity for small sample size; 3) consideration of imbalanced demographic structure, potential outlier removal, and a multi-task treatment that uses the prediction of clinical classification as prior for MMSE prediction. The proposed approach delivers an average of 78.2% Balanced Accuracy in MCI detection and an averaged RMSE of 2.705 in predicting MMSE. Our empirical evaluation shows that translingual language markers can improve the detection of MCI from spontaneous speech. Our codes are provided in https://github.com/illidanlab/translingual-language-markers.
Cognitive decline, a hallmark of several neurological conditions, including dementia and Alzheimer's disease, often manifests in noticeable changes in speech patterns and language use. Speech analysis in this context can serve as a valuable tool for early detection and monitoring of cognitive impairment. In this paper, we present the results of our attempts at the TAUKADIAL Challenge for automatically detecting people with mild cognitive impairment and predicting a cognitive score on English and Chinese speakers. In the classification task, we achieved a UAR of 83% using two language-dependent classifiers trained with timing and acoustic features. In the regression task, we obtained an RMSE of 1.87 using English speakers to train the base model with timing, acoustic, and language-dependent features.
To deepen and enrich our daily communications, researchers have made significant efforts over several decades to develop technologies that can recognize and understand natural human conversations. Despite significant progress in both speech/language processing and speech enhancement technology, conversational speech processing remains challenging. Recordings of conversations with distant microphones contain ambient noise, reverberation, and speaker overlap that changes as the conversation progresses. Consequently, recognizing conversational speech is much more challenging than single-talker speech recognition, and frontend technologies such as speech enhancement and speaker diarization are essential to achieving highly accurate conversational speech processing. For more than two decades, the presenter‘s research group has explored frontend techniques (source separation, dereverberation, noise reduction, and diarization) for handling realistic natural conversations with distant microphones. In this talk, I would like to talk about the evolution and frontier of frontend technologies for conversational signal processing. Specifically, we will trace the evolution of multichannel signal processing and neural network techniques, including beamforming and target speaker tracking and extraction, which have always played an important role in successive cutting-edge frontends, along with the latest achievements.
This paper describes a simple yet robust approach to performing reference-free estimation of the quality of automatically-generated clinical notes derived from doctor-patient conversations. In the absence of human-written reference notes, this approach works by generating a diverse collection of "pseudo-reference notes" and comparing the generated note against those pseudo-references. This method has been applied to estimate the quality of clinical note sections generated by three different note generation models, using a collection of evaluation metrics that are based on natural language inference and clinical concept extraction. Our experiments show the proposed approach is robust to the choice of note generation models, and consistently produces higher correlations with reference-based counterparts when compared against a strong baseline method.
Autism Spectrum Disorder (ASD) is a lifelong condition that significantly influencing an individual's communication abilities and their social interactions. Early diagnosis and intervention are critical due to the profound impact of ASD's characteristic behaviors on foundational developmental stages. However, limitations of standardized diagnostic tools necessitate the development of objective and precise diagnostic methodologies. This paper proposes an end-to-end framework for automatically predicting the social communication severity of children with ASD from raw speech data. This framework incorporates an automatic speech recognition model, fine-tuned with speech data from children with ASD, followed by the application of fine-tuned pre-trained language models to generate a final prediction score. Achieving a Pearson Correlation Coefficient of 0.6566 with human-rated scores, the proposed method showcases its potential as an accessible and objective tool for the assessment of ASD.
Vocal biomarkers are measurable characteristics of person's voice that provide valuable insights into various aspects of their physiological and psychological state, or health status. The use of standardized voice tasks, such as reading, counting, or sustained vowel phonation are common in vocal biomarker research, but semi-spontaneous tasks where the person is instructed to talk about a particular topic, or spontaneous speech are also increasingly used. However, limited efforts were made to combine multiple voice modalities. In this paper, we propose a simple, yet efficient approach of fusing multiple standardized voice tasks based on vector cross-attention, showing improved predictive capacity for derived vocal biomarkers in comparison to single modalities. The multimodal approach is tested on the assessment of respiratory quality of life from reading and sustained vowel phonation recordings, outperforming single modalities up to 4.2% in terms of accuracy (relative increase of 7%).
Numerous speech-based health assessment studies report high accuracy rates for machine learning models which detect conditions such as depression and Alzheimer’s disease. There are growing concerns that these reported performances are often overestimated, especially in small-scale cross-sectional studies. Possible causes for this overestimation include overfitting, publication biases and a lack of standard procedures to report findings and testing methodology. Another key source of misrepresentation is the reliance on aggregate-level performance metrics. Speech is a highly variable signal that can be affected by factors including age, sex, and accent, which can easily bias models. We highlight this impact by presenting a simple benchmark model for assessing the extent to which aggregate metrics exaggerate the efficacy of a machine learning model in the presence of confounders. We then demonstrate the usefulness of this model on exemplar speech-health assessment datasets.
The world of voice biomarkers is rapidly evolving thanks to the use of artificial intelligence (AI) allowing large-scale analysis of voice, speech, and respiratory sound data. The Bridge2AI-Voice project aims to build a large-scale, ethically sourced, and diverse voice database of human voices linked to health information to help fuel Voice AI research, dubbed Audiomics. The current paper describes the development of protocols of data acquisition across 4 different adult cohorts of disease (voice, respiratory, neurodegenerative diseases, mood, and anxiety disorders) using a Team Science approach for broader adoption by the research community and feedback. Demographic Surveys, Confounders Assessments, Acoustic tasks, validated patient-reported outcome (PRO) questionnaires and clinician-validated diagnostic questions were grouped in a common PART A across all cohorts and individual PART B, with cohort-specific tasks.
Depression, a prevalent mental health disorder impacting millions globally, demands reliable assessment systems. Unlike previous studies that focus solely on either detecting depression or predicting its severity, our work identifies individual symptoms of depression while also predicting its severity using speech input. We leverage self-supervised learning (SSL)-based speech models to better utilize the small-sized datasets that are frequently encountered in this task. Our study demonstrates notable performance improvements by utilizing SSL embeddings compared to conventional speech features. We compare various types of SSL pretrained models to elucidate the type of speech information (semantic, speaker, or prosodic) that contributes the most in identifying different symptoms. Additionally, we evaluate the impact of combining multiple SSL embeddings on performance. Furthermore, we show the significance of multi-task learning for identifying depressive symptoms effectively.
The most common types of voice disorders are associated with hyperfunctional voice use in daily life. Although current clinical practice uses measures from brief laboratory recordings to assess vocal function, it is unclear how these relate to an individual’s habitual voice use. The purpose of this study was to quantify the correlation and offset between voice features computed from laboratory and ambulatory recordings in speakers with and without vocal hyperfunction. Features derived from a neck-surface accelerometer included estimates of sound pressure level, fundamental frequency, cepstral peak prominence, and spectral tilt. Whereas some measures from laboratory recordings correlated significantly with those captured during daily life, only approximately 6–52% of the actual variance was accounted for. Thus, brief voice assessments are quite limited in the extent to which they can accurately characterize the daily voice use of speakers with and without vocal hyperfunction.
Evaluating pain in speech represents a critical challenge in high-stakes clinical scenarios, from analgesia delivery to emergency triage. Clinicians have predominantly relied on direct verbal communication of pain which is difficult for patients with communication barriers, such as those affected by stroke, autism, and learning difficulties. Many previous efforts have focused on multimodal data which does not suit all clinical applications. Our work is the first to collect a new English speech dataset wherein we have induced acute pain in adults using a cold pressor task protocol and recorded subjects reading sentences out loud. We report pain discrimination performance as F1 scores from binary (pain vs. no pain) and three-class (mild, moderate, severe) prediction tasks, and support our results with explainable feature analysis. Our work is a step towards providing medical decision support for pain evaluation from speech to improve care across diverse and remote healthcare settings.
To develop intelligent speech assistants and integrate them seamlessly with intra-operative decision-support frameworks, accurate and efficient surgical phase recognition is a prerequisite. In this study, we propose a multimodal framework based on Gated Multimodal Units (GMU) and Multi-Stage Temporal Convolutional Networks (MS-TCN) to recognize surgical phases of port-catheter placement operations. Our method merges speech and image models and uses them separately in different surgical phases. Based on the evaluation of 28 operations, we report a frame-wise accuracy of 92.65 ± 3.52% and an F1-score of 92.30 ± 3.82%. Our results show approximately 10% improvement in both metrics over previous work and validate the effectiveness of integrating multimodal data for the surgical phase recognition task. We further investigate the contribution of individual data channels by comparing mono-modal models with multimodal models.
This paper presents a novel multimodal framework to distinguish between different symptom classes of subjects in the schizophrenia spectrum and healthy controls using audio, video, and text modalities. We implemented Convolution Neural Network and Long Short Term Memory based unimodal models and experimented on various multimodal fusion approaches to come up with the proposed framework. We utilized a minimal Gated multimodal unit (mGMU) to obtain a bi-modal intermediate fusion of the features extracted from the input modalities before finally fusing the outputs of the bimodal fusions to perform subject-wise classifications. The use of mGMU units in the multimodal framework improved the performance in both weighted f1-score and weighted AUC-ROC scores.
Pre-trained models generate speech representations that are used in different tasks, including the automatic detection of Parkinson’s disease (PD). Although these models can yield high accuracy, their interpretation is still challenging. This paper used a pre-trained Wav2vec 2.0 model to represent speech frames of 25ms length and perform a frame-by-frame discrimination between PD patients and healthy control (HC) subjects. This fine granularity prediction enabled us to identify specific linguistic segments with high discrimination capability. Speech representations of all produced verbs were compared w.r.t. nouns and the first ones yielded higher accuracies. To gain a deeper understanding of this pattern, representations of motor and non-motor verbs were compared and the first ones yielded better results, with accuracies of around 83% in an independent test set. These findings support well-established neurocognitive models about action-related language highlighted as key drivers of PD.
Head and Neck Cancers (HNC) significantly impact patients' ability to speak, affecting their quality of life. Commonly used metrics for assessing pathological speech are subjective, prompting the need for automated and unbiased evaluation methods. This study proposes a self-supervised Wav2Vec2-based model for phone classification with HNC patients, to enhance accuracy and improve the discrimination of phonetic features for subsequent interpretability purpose. The impact of pre-training datasets, model size, and fine-tuning datasets and parameters are explored. Evaluation on diverse corpora reveals the effectiveness of the Wav2Vec2 architecture, outperforming a CNN-based approach, used in previous work. Correlation with perceptual measures also affirms the model relevance for impaired speech analysis. This work paves the way for better understanding of pathological speech with interpretable approaches for clinicians, by leveraging complex self-learnt speech representations.
This work explores the potential of Large Language Models (LLMs) as annotators of high-level characteristics of speech transcriptions, which may be relevant for detecting Alzheimer's disease (AD). These low-dimension interpretable features, here designated as macro-descriptors (e.g. text coherence, lexical diversity), are then used to train a binary classifier. Our experiments compared the extraction of these features from both manual and automatic transcriptions obtained with different types of speech recognition systems, and involved both open and closed source LLMs, with several prompting strategies. The experiments also compared the use of macro-descriptors with the direct prediction of AD by the LLM, given the transcription. Even though LLMs are not trained for this task, our experiments show that they achieve up to 81% accuracy, surpassing the baseline of previous AD detection challenges, particularly when used as extractors of macro-descriptors.
Speech pauses, alongside content and structure, offer a valuable and non-invasive biomarker for detecting dementia. This work investigates the use of pause-enriched transcripts in transformer-based language models to differentiate the cognitive states of subjects with no cognitive impairment, mild cognitive impairment, and Alzheimer's dementia based on their speech from a clinical assessment. We address three binary classification tasks: Onset, monitoring, and dementia exclusion. The performance is evaluated through experiments on a German Verbal Fluency Test and a Picture Description Test, comparing the model's effectiveness across different speech production contexts. Starting from a textual baseline, we investigate the effect of incorporation of pause information and acoustic context. We show the test should be chosen depending on the task, and similarly, lexical pause information and acoustic cross-attention contribute differently.
Early assessment of mild cognitive impairment (MCI) has the potential to expedite interventions and slow disease progress for people at risk of developing dementia. We investigate the feasibility of administering remote assessments of speech, orofacial and cognitive function to an elderly population with MCI via a cloud-based conversational remote monitoring platform, and the utility of automatically extracted multimodal biomarkers and self-reported problems in identifying MCI patients. We analyzed data from 90 MCI patients and 91 controls who each completed two assessments. 90% of participants reported excellent engagement and liked their overall user experience. Furthermore, combining multiple facial, speech and cognitive markers performed best at distinguishing MCI patients from controls with an AUC of 0.75 using a support vector machine classifier. Finally, we found that MCI patients reported significantly more problems related to memory, falls, anxiety and speech than controls.