| Total: 79
Audio quality assessment is critical for assessing the perceptual realism of sounds. However, the time and expense of obtaining "gold standard” human judgments limit the availability of such data. For AR&VR, good perceived sound quality and localizability of sources are among the key elements to ensure complete immersion of the user. Our work introduces SAQAM which uses a multi-task learning framework to assess listening quality (LQ) and spatialization quality (SQ) between any given pair of binaural signals without using any subjective data. We model LQ by training on a simulated dataset of triplet human judgments, and SQ by utilizing activation-level distances from networks trained for direction of arrival (DOA) estimation. We show that SAQAM correlates well with human responses across four diverse datasets. Since it is a deep network, the metric is differentiable, making it suitable as a loss function for other tasks. For example, simply replacing an existing loss with our metric yields improvement in a speech-enhancement network.
Human judgments obtained through Mean Opinion Scores (MOS) are the most reliable way to assess the quality of speech signals. However, several recent attempts to automatically estimate MOS using deep learning approaches lack robustness and generalization capabilities, limiting their use in real-world applications. In this work, we present a novel framework, NORESQA-MOS, for estimating the MOS of a speech signal. Unlike prior works, our approach uses non-matching references as a form of conditioning to ground the MOS estimation by neural networks. We show that NORESQA-MOS provides better generalization and more robust MOS estimation than previous state-of-the-art methods such as DNSMOS and NISQA, even though we use a smaller training set. Moreover, we also show that our generic framework can be combined with other learning methods such as self-supervised learning and can further supplement the benefits from these methods.
We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. It enables us to evaluate different pitch extractors with unified criteria. The method uses extended time-stretched pulses combined by binary orthogonal sequences. It provides simultaneous measurement results consisting of the linear and the non-linear time-invariant responses and random and time-varying responses. We tested representative pitch extractors using fundamental frequencies spanning 80~Hz to 800~Hz with 1/48 octave steps and produced more than 2000 modulation frequency response plots. We found that making scientific visualization by animating these plots enables us to understand different pitch extractors' behavior at once. Such efficient and effortless inspection is impossible by inspecting all individual plots. The proposed measurement method with visualization leads to further improvement of the performance of one of the extractors mentioned above. In other words, our procedure turns the specific pitch extractor into the best reliable measuring equipment that is crucial for scientific research. We open-sourced MATLAB codes of the proposed objective measurement method and visualization procedure.
Fake audio detection (FAD) is a technique to distinguish synthetic speech from natural speech. In most FAD systems, removing irrelevant features from acoustic speech while keeping only robust discriminative features is essential. Intuitively, speaker information entangled in acoustic speech should be suppressed for the FAD task. Particularly in a deep neural network (DNN)-based FAD system, the learning system may learn speaker information from a training dataset and cannot generalize well on a testing dataset. In this paper, we propose to use the speaker anonymization (SA) technique to suppress speaker information from acoustic speech before inputting it into a DNN-based FAD system. We adopted the McAdams-coefficient-based SA (MC-SA) algorithm, and this is expected that the entangled speaker information will not be involved in the DNN-based FAD learning. Based on this idea, we implemented a light convolutional neural network bidirectional long short-term memory (LCNN-BLSTM)-based FAD system and conducted experiments on the Audio Deep Synthesis Detection Challenge (ADD2022) datasets. The results showed that removing the speaker information from acoustic speech improved the relative performance in the first track of ADD2022 by 17.66%.
Contrastive learning enables learning useful audio and speech representations without ground-truth labels by maximizing the similarity between latent representations of similar signal segments. In this framework various data augmentation techniques are usually exploited to help enforce desired invariances within the learned representations, improving performance on various audio tasks thanks to more robust embeddings. Now, selecting the most relevant augmentations has proven crucial for better downstream performances. Thus, this work introduces a conditional independance-based method which allows for automatically selecting a suitable distribution on the choice of augmentations and their parametrization from a set of predefined ones, for contrastive self-supervised pre-training. This is performed with respect to a downstream task of interest, hence saving a costly hyper-parameter search. Experiments performed on two different downstream tasks validate the proposed approach showing better results than experimenting without augmentation or with baseline augmentations. We furthermore conduct a qualitative analysis of the automatically selected augmentations and their variation according to the considered final downstream dataset.
In this paper, we propose a computational measure for the quality of audio in user-generated multimedia (UGM) in accordance with the human perceptual system. To this end, we first extend the previously proposed IIT-JMU-UGM Audio dataset by including samples with more diverse context, content, distortion types, and intensities, along with implicitly distorted audio that reflect realistic scenarios. We conduct subjective testing on the extended database containing 2075 audio clips to obtain the mean opinion scores for each sample. We then introduce transformer-based learning to the domain of audio quality assessment, which is trained on three vital audio features: Mel-frequency cepstral coefficients, chroma, and Mel-scaled spectrogram. The proposed non-intrusive transformer-based model is compared against state-of-the-art methods and found to outperform Simple RNN, LSTM, and GRU models by over 4%. The database and the source code will be made public upon acceptance.
Speech representations which are robust to pathology-unrelated cues such as speaker identity information have been shown to be advantageous for automatic dysarthric speech classification. A recently proposed technique to learn speaker identity-invariant representations for dysarthric speech classification is based on adversarial training. However, adversarial training can be challenging, unstable, and sensitive to training parameters. To avoid adversarial training, in this paper we propose to learn speaker-identity invariant representations exploiting a feature separation framework relying on mutual information minimization. Experimental results on a database of neurotypical and dysarthric speech show that the proposed adversarial-free framework successfully learns speaker identity-invariant representations. Further, it is shown that such representations result in a similar dysarthric speech classification performance as the representations obtained using adversarial training, while the training procedure is more stable and less sensitive to training parameters.
Wilson's disease (WD), a rare genetic movement disorder, is characterized by early-onset dysarthria. Automated speech assessment is thus valuable in early diagnosis and intervention. Time-frequency features, such as Mel-frequency cepstral coefficients (MFCC), have been frequently used. However, human speech signals are nonlinear and nonstationary, which cannot be captured by traditional features based on the Fourier transform. Moreover, the dysarthria type of WD patients is complex and different from other movement disorders such as Parkinson's disease. Thus, developing sensitive time-frequency measures for WD patients is needed. The present study proposes DMFCC, the improved MFCC using signal decomposition. We validate the usefulness of DMFCC in WD detection with a sample of 60 WD patients and 60 matched healthy controls. Results show that the DMFCC achieves the best classification accuracy (86.1%), improving by 13.9%-44.4% compared to baseline features such as MFCC and the state-of-art Hilbert cepstral coefficients (HCCs). The present study is a first attempt to demonstrate the validity of automated acoustic measures in WD detection, and the proposed DMFCC provides a novel tool for speech assessment.
Congenital amusia is a neurogenetic disorder, affecting music pitch processing. It also transfers to the language domain and negatively influences the perception of linguistic components relying on pitch, such as lexical tones. It has been well established that unfavorable listening conditions impact lexical tone perception in amusics. For instance, both Mandarin- and Cantonese-speaking amusics were impaired in tone processing under simultaneously noisy conditions. Backward noise is one of the adverse listening conditions, but its interference mechanism is distinct from the simultaneous noise. Therefore, it warrants more studies to explore whether and how backward masking noise affects tone processing in amusics. In the current study, eighteen Mandarin-speaking amusics and 18 controls were tested on discrimination of Mandarin tones under two conditions: a quiet condition involving relatively low-level processing and a backward masking condition involving high-level processing (e.g., tone categorization) where a native multi-talker babble noise was added to target tones. The results revealed that amusics performed similarly to controls in quiet conditions, whereas poorer performance in backward noise conditions. These findings shed light on how adverse listening environments influence amusics' lexical tone processing and provided further empirical evidence that amusics may be impaired in the high-level phonological processing of lexical tone.
Dementia is a severe cognitive impairment that affects the health of older adults and creates a burden on their families and caretakers. This paper analyzes diverse features extracted from spoken languages and selects the most discriminative features for dementia detection. The paper presents a deep learning-based feature ranking method called dual-net feature ranking (DFR). The proposed DFR utilizes a dual-net architecture, where two networks (called operator and selector) are alternatively and cooperatively trained to simultaneously perform feature selection and dementia detection. The DFR interprets the contribution of individual features to the predictions of the selector network using all of the selector's parameters. The DFR was evaluated on the Cantonese JCCOCC-MoCA Elderly Speech Dataset. Results show that the DFR can significantly reduce feature dimensionality while identifying small feature subsets with comparable or superior performance than the whole feature set. The selected features have been uploaded to https://github.com/kexquan/AD-detection-Feature-selection.
In contrast to previous studies that look only at discriminating pathological voice from the normal voice, in this study we focus on the discrimination between cases of spasmodic dysphonia (SD) and vocal fold palsy (VP) using automated analysis of speech recordings. The hypothesis is that discrimination will be enhanced by studying continuous speech, since the different pathologies are likely to have different effects in different phonetic contexts. We collected audio recordings of isolated vowels and of a read passage from 60 patients diagnosed with SD (N=38) or VP (N=22). Baseline classifiers on features extracted from the recordings taken as a whole gave a cross-validated unweighted average recall of up to 75% for discriminating the two pathologies. We used an automated method to divide the read passage into phone-labelled regions and built classifiers for each phone. Results show that the discriminability of the pathologies varied with phonetic context as predicted. Since different phone contexts provide different information about the pathologies, classification is improved by fusing phone predictions, to achieve a classification accuracy of 83%. The work has implications for the differential diagnosis of voice pathologies and contributes to a better understanding of their impact on speech.
This work presents an outer product-based approach to fuse the embedded representations learnt from the spectrograms of cough, breath, and speech samples for the automatic detection of COVID-19. To extract deep learnt representations from the spectrograms, we compare the performance of specific Convolutional Neural Networks (CNNs) trained from scratch and ResNet18-based CNNs fine-tuned for the task at hand. Furthermore, we investigate whether the patients' sex and the use of contextual attention mechanisms are beneficial. Our experiments use the dataset released as part of the Second Diagnosing COVID-19 using Acoustics (DiCOVA) Challenge. The results suggest the suitability of fusing breath and speech information to detect COVID-19. An Area Under the Curve (AUC) of 84.06 % is obtained on the test partition when using specific CNNs trained from scratch with contextual attention mechanisms. When using ResNet18-based CNNs for feature extraction, the baseline model scores the highest performance with an AUC of 84.26 %.
A fast, efficient and accurate detection method of COVID-19 remains a critical challenge. Many cough-based COVID-19 detection researches have shown competitive results through artificial intelligence. However, the lack of analysis on vocalization characteristics of cough sounds limits the further improvement of detection performance. In this paper, we propose two novel acoustic features of cough sounds and a convolutional neural network structure for COVID-19 detection. First, a time-frequency differential feature is proposed to characterize dynamic information of cough sounds in time and frequency domain. Then, an energy ratio feature is proposed to calculate the energy difference caused by the phonation characteristics in different cough phases. Finally, a convolutional neural network with two parallel branches which is pre-trained on a large amount of unlabeled cough data is proposed for classification. Experiment results show that our proposed method achieves state-of-the-art performance on Coswara dataset for COVID-19 detection. The results on an external clinical dataset Virufy also show the better generalization ability of our proposed method.
The present study investigates the use of 1-dimensional (1-D) and 2-dimensional (2-D) spectral feature representations in voice pathology detection with several classical machine learning (ML) and recent deep learning (DL) classifiers. Four popularly used spectral feature representations (static mel-frequency cepstral coefficients (MFCCs), dynamic MFCCs, spectrogram and mel-spectrogram) are derived in both the 1-D and 2-D form from voice signals. Three widely used ML classifiers (support vector machine (SVM), random forest (RF) and Adaboost) and three DL classifiers (deep neural network (DNN), long short-term memory (LSTM) network, and convolutional neural network (CNN)) are used with the 1-D feature representations. In addition, CNN classifiers are built using the 2-D feature representations. The popularly used HUPA database is considered in the pathology detection experiments. Experimental results revealed that using the CNN classifier with the 2-D feature representations yielded better accuracy compared to using the ML and DL classifiers with the 1-D feature representations. The best performance was achieved using the 2-D CNN classifier based on dynamic MFCCs that showed a detection accuracy of 81%.
Aphasia is a common speech and language disorder, typically caused by a brain injury or a stroke, that affects millions of people worldwide. Detecting and assessing Aphasia in patients is a difficult, time-consuming process, and numerous attempts to automate it have been made, the most successful using machine learning models trained on aphasic speech data. Like in many medical applications, aphasic speech data is scarce and the problem is exacerbated in so-called ``low resource" languages, which are, for this task, most languages excluding English. We attempt to leverage available data in English and achieve zero-shot aphasia detection in low-resource languages such as Greek and French, by using language-agnostic linguistic features. Current cross-lingual aphasia detection approaches rely on manually extracted transcripts. We propose an end-to-end pipeline using pre-trained Automatic Speech Recognition (ASR) models that share cross-lingual speech representations and are fine-tuned for our desired low-resource languages. To further boost our ASR model's performance, we also combine it with a language model. We show that our ASR-based end-to-end pipeline offers comparable results to previous setups using human-annotated transcripts.
Detecting dementia using human speech is promising but faces a limited data challenge. While recent research has shown general pretrained models (e.g., BERT) can be applied to improve dementia detection, the pretrained model can hardly be fine-tuned with the available small dementia dataset as that would raise the overfitting problem. In this paper, we propose a domain-aware intermediate pretraining to enable a pretraining process using a domain-similar dataset that is selected by incorporating the knowledge from the dementia dataset. Specifically, we use pseudo-perplexity to find an effective pretraining dataset, and then propose dataset-level and sample-level domain-aware intermediate pretraining techniques. We further employ information units (IU) from previous dementia research and define an IU-pseudo-perplexity to reduce calculation complexity. We confirm the effectiveness of perplexity by showing a strong correlation between perplexity and accuracy using 9 datasets and models from the GLUE benchmark. We show that our domain-aware intermediate pretraining improves detection accuracy in almost all cases. Our results suggested that the difference in text-based perplexity values between patients with Alzheimer's Dementia and Healthy Control is still small, and the perplexity incorporating acoustic features (e.g., pause) may make the pretraining more effective.
Altered quality of the phonetic-acoustic information in the speech signal in the case of motor speech disorders may reduce its intelligibility. Monitoring intelligibility is part of the standard clinical assessment of patients. It is also a valuable tool to index the evolution of the speech disorder. However, measuring intelligibility raises methodological debates concerning: the type of linguistic material on which the assessment is based (non-words, words, continuous speech), the evaluation protocol and type of scores (scale-based rating, transcription or recognition tests), and the advantages and disadvantages of listener vs. automatic-based approaches (subjective vs. objective, expertise level, types of models used). In this paper, the intelligibility of the speech of 32 French patients presenting mild to moderate dysarthria and 17 elderly speakers is assessed with five different methods: impressionistic clinician judgment on continuous speech, number of words recognized in an interactive face-to-face setting and in an on-line testing of the same material by 75 judges, automatic feature-based and automatic speech recognition-based methods (both on short sentences). The implications of the different methods for clinical practice are discussed.
Acoustic realisation of the working vowel space has been widely studied in Parkinson's disease (PD). However, it has never been studied in atypical parkinsonian disorders (APD). The latter are neurodegenerative diseases which share similar clinical features with PD, rendering the differential diagnosis very challenging in early disease stages. This paper presents the first contribution in vowel space analysis in APD, by comparing corner vowel realisation in PD and the parkinsonian variant of Multiple System Atrophy (MSA-P). Our study has the particularity of focusing exclusively on early stage PD and MSA-P patients, as our main purpose was early differential diagnosis between these two diseases. We analysed the corner vowels, extracted from a spoken sentence, using traditional vowel space metrics. We found no statistical difference between the PD group and healthy controls (HC) while MSA-P exhibited significant differences with the PD and HC groups. We also found that some metrics conveyed complementary discriminative information. Consequently, we argue that restriction in the acoustic realisation of corner vowels cannot be a viable early marker of PD, as hypothesised by some studies, but it might be a candidate as an early hypokinetic marker of MSA-P (when the clinical target is discrimination between PD and MSA-P).
Cordectomized or laryngectomized patients recover the ability to speak thanks to devices able to produce a natural-sounding voice source in real time. However, constant voicing can impair the naturalness and intelligibility of reconstructed speech. Voicing decision, consisting in identifying whether an uttered phone should be voiced or not, is investi- gated here as an automatic process in the context of whisper-to-speech (W2S) conversion systems. Whereas state-of-the-art approaches apply DNN techniques on high-dimensional acoustic features, we seek here a low-resource alternative approach for a perceptually-meaningful mapping between acoustic features and voicing decision, suitable for real-time applications. Our method first classifies whisper signal frames into phoneme classes based on their spectral centroid and spread, and then discriminate voiced phonemes from their unvoiced counterpart based on class-dependent spectral centroid thresholds. We compared our method to a simpler approach using a single centroid threshold on several databases of annotated whispers in both single-speaker and multi-speaker training setups. While both approaches reach voicing accuracy higher than 91%, the proposed method allows to avoid some systematic voicing decision errors, which may allow users to learn to adapt their speech in real-time to compensate for remaining voicing errors.
Parkinson's disease (PD) is characterized by motor dysfunction; however, non-motor symptoms such as cognitive decline also have a dramatic impact on quality of life. Current assessments to diagnose cognitive impairment take many hours and require high clinician involvement. Thus, there is a need to develop new tools leading to quick and accurate determination of cognitive impairment to allow for appropriate, timely interventions. In this paper, individuals with PD, designated as either having no cognitive impairment (NCI) or mild cognitive impairment (MCI), undergo a speech-based protocol, involving reading or listing items within a category, performed either with or without a concurrent drawing task. From the speech recordings, we extract motor coordination-based features, derived from correlations across acoustic features representative of speech production subsystems. The correlation-based features are utilized in gaussian mixture models to discriminate between individuals designated NCI or MCI in both the single and dual task paradigms. Features derived from the laryngeal and respiratory subsystems, in particular, discriminate between these two groups with AUCs > 0.80. These results suggest that cognitive impairment can be detected using speech from both single and dual task paradigms, and that cognitive impairment may manifest as differences in vocal fold vibration stability.
Acoustic analysis plays an important role in the assessment of dysarthria. Out of a public health necessity, telepractice has become increasingly adopted as the modality in which clinical care is given. While there are differences in software among telepractice platforms, they all use some form of speech compression to preserve bandwidth, with the most common algorithm being the Opus codec. Opus has been optimized for compression of speech from the general (mostly healthy) population. As a result, for speech-language pathologists, this begs the question: is the remotely transmitted speech signal a faithful representation of dysarthric speech? Existing high-fidelity audio recordings from 20 speakers of various dysarthria types were encoded at three different bit rates defined within Opus to simulate different internet bandwidth conditions. Acoustic measures of articulation, voice, and prosody were extracted, and mixed-effect models were used to evaluate the impact of bandwidth conditions on the measures. Significant differences in cepstral peak prominence, degree of voice breaks, jitter, vowel space area, pitch, and vowel space area were observed after Opus processing, providing insight into the types of acoustic measures that are susceptible to speech compression algorithms.
Despite significant progress in areas such as speech recognition, cochlear implant users still experience challenges related to identifying various speaker traits such as gender, age, emotion, accent, etc. In this study, we focus on emotion as one trait. We propose the use of emotion intensity conversion to perceptually enhance emotional speech with the goal of improving speech emotion recognition for cochlear implant users. To this end, we utilize a parallel speech dataset containing emotion and intensity labels to perform conversion from normal to high intensity emotional speech. A non-negative matrix factorization method is integrated to perform emotion intensity conversion via spectral mapping. We evaluate our emotional speech enhancement using a support vector machine model for emotion recognition. In addition, we perform an emotional speech recognition listener experiment with normal hearing listeners using vocoded audio. It is suggested that such enhancement will benefit speaker trait perception for cochlear implant users.
Medialization thyroplasty is a frequently used surgical treatment for insufficient glottal closure and involves placement of an implant to medialize the vocal fold. Prior studies have been unable to determine optimal implant shape and stiffness. In this study, thyroplasty implants with various medial surface shapes (rectangular, convergent, or divergent) and stiffnesses (Silastic, Gore-Tex, soft silicone of varying stiffness, or hydrogel) were assessed for optimal voice quality in an in vivo canine model of unilateral vocal fold paralysis with graded contralateral neuromuscular stimulation to mimic expected compensation seen in patients with this laryngeal pathology. Across experiments, Silastic rectangular implants consistently result in an improved voice quality metric, indicating high-quality output phonation. These findings have clinical implications for the optimization of thyroplasty implant treatment for speakers with laryngeal pathologies causing glottic insufficiency.
While using two-dimensional convolutional neural networks (2D-CNNs) in image processing, it is possible to manipulate domain information using channel statistics, and instance normalization has been a promising way to get domain-invariant features. Unlike image processing, we analyze that domain-relevant information in an audio feature is dominant in frequency statistics rather than channel statistics. Motivated by our analysis, we introduce Relaxed Instance Frequency-wise Normalization (RFN): a plug-and-play, explicit normalization module along the frequency axis which can eliminate instance-specific domain discrepancy in an audio feature while relaxing undesirable loss of useful discriminative information. Empirically, simply adding RFN to networks shows clear margins compared to previous domain generalization approaches on acoustic scene classification and yields improved robustness for multiple audio-devices. Especially, the proposed RFN won the DCASE2021 challenge TASK1A, low-complexity acoustic scene classification with multiple devices, with a clear margin, and this work is extended version of the work.
The recently proposed Mean Teacher method, which exploits large-scale unlabeled data in a self-ensembling manner, has achieved state-of-the-art results in several semi-supervised learning benchmarks. Spurred by current achievements, this paper proposes an effective Couple Learning method that combines a well-trained model and a Mean Teacher model. The suggested pseudo-labels generated model (PLG) increases strongly- and weakly-labeled data to improve the Mean Teacher method's performance. Moreover, the Mean Teacher's consistency cost reduces the noise impact in the pseudo-labels introduced by detection errors. The experimental results on Task 4 of the DCASE2020 challenge demonstrate the superiority of the proposed method, achieving about 44.25% F1-score on the validation set without post-processing, significantly outperforming the baseline system's 32.39%. furthermore, this paper also propose a simple and effective experiment called the Variable Order Input (VOI) experiment, which proves the significance of the Couple Learning method. Our developed Couple Learning code is available on GitHub.