| Total: 79
The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, that not only model the latent space, but also model the temporal dependencies within a sequence of data vectors and corresponding latent vectors, relying on recurrent neural networks. We recently performed a comprehensive review of those models and unified them into a general class called Dynamical Variational Autoencoders (DVAEs). In the present paper, we present the results of an experimental benchmark comparing six of those DVAE models on the speech analysis-resynthesis task, as an illustration of the high potential of DVAEs for speech modeling.
Accurate phoneme detection and processing can enhance speech intelligibility in hearing aids and audio & speech codecs. As fricative phonemes have an important part of their energy concentrated in high frequency bands, frequency lowering algorithms are used in hearing aids to improve fricative intelligibility for people with high-frequency hearing loss. In traditional audio codecs, while processing speech in blocks, spectral smearing around fricative phoneme borders results in pre and post echo artifacts. Hence, detecting the fricative borders and adapting the processing accordingly could enhance the quality of speech. Until recently, phoneme detection and analysis were mostly done by extracting features specific to the class of phonemes. In this paper, we present a deep learning based fricative phoneme detection algorithm that exceeds the state-of-the-art fricative phoneme detection accuracy on the TIMIT speech corpus. Moreover, we compare our method to other approaches that employ classical signal processing for fricative detection and also evaluate it on the TIMIT files coded with AAC codec followed by bandwidth limitation. Reported results of our deep learning approach on original TIMIT files are reproducible and come with an easy to use code that could serve as a baseline for any future research on this topic.
Formants are major resonances in the vocal tract system. Identification of formants is important for study of speech. In the literature, formants are typically identified by first deriving formant frequency candidates (e.g., using linear prediction) and then applying a tracking mechanism. In this paper, we propose a simple tracking-free formant identification approach based on zero frequency filtering. More precisely, formants F1-F2 are identified by modifying the trend removal operation in zero frequency filtering and picking simply the dominant peak in the short-term discrete Fourier transform spectra. We demonstrate the potential of the approach by comparing it against state-of-the-art formant identification approaches on a typical speech data set (TIMIT-VTR) and an atypical speech data set (PC-GITA).
Phoneme-to-audio alignment is the task of synchronizing voice recordings and their related phonetic transcripts. In this work, we introduce a new system to forced phonetic alignment with Recurrent Neural Networks (RNN). With the Connectionist Temporal Classification (CTC) loss as training objective, and an additional reconstruction cost, we learn to infer relevant per-frame phoneme probabilities from which alignment is derived. The core of the neural architecture is a context-aware attention mechanism between mel-spectrograms and side information. We investigate two contexts given by either phoneme sequences (model PhAtt) or spectrograms themselves (model SpAtt). Evaluations show that these models produce precise alignments for both speaking and singing voice. Best results are obtained with the model PhAtt, which outperforms baseline reference with an average imprecision of 16.3ms and 29.8ms on speech and singing, respectively. The model SpAtt also appears as an interesting alternative, capable of aligning longer audio files without requiring phoneme sequences on small audio segments.
We estimate articulatory movements in speech production from different modalities - acoustics and phonemes. Acoustic-to-articulatory inversion (AAI) is a sequence-to-sequence task. On the other hand, phoneme to articulatory (PTA) motion estimation faces a key challenge in reliably aligning the text and the articulatory movements. To address this challenge, we explore the use of a transformer architecture — FastSpeech, with explicit duration modelling to learn hard alignments between the phonemes and articulatory movements. We also train a transformer model on AAI. We use correlation coefficient (CC) and root mean squared error (rMSE) to assess the estimation performance in comparison to existing methods on both tasks. We observe 154%, 11.8% & 4.8% relative improvement in CC with subject-dependent, pooled and fine-tuning strategies, respectively, for PTA estimation. Additionally, on the AAI task, we obtain 1.5%, 3% and 3.1% relative gain in CC on the same setups compared to the state-of-the-art baseline. We further present the computational benefits of having transformer architecture as representation blocks.
It is well known that the mismatch between training (source) and test (target) data distribution will significantly decrease the performance of acoustic scene classification (ASC) systems. To address this issue, domain adaptation (DA) is one solution and many unsupervised DA methods have been proposed. These methods focus on a scenario of single source domain to single target domain. However, we will face such problem that test data comes from multiple target domains. This problem can be addressed by producing one model per target domain, but this solution is too costly. In this paper, we propose a novel unsupervised multi-target domain adaption (MTDA) method for ASC, which can adapt to multiple target domains simultaneously and make use of the underlying relation among multiple domains. Specifically, our approach combines traditional adversarial adaptation with two novel discriminator tasks that learns a common subspace shared by all domains. Furthermore, we propose to divide the target domain into the easy-to-adapt and hard-to-adapt domain, which enables the system to pay more attention to hard-to-adapt domain in training. The experimental results on the DCASE 2020 Task 1-A dataset and the DCASE 2019 Task 1-B dataset show that our proposed method significantly outperforms the previous unsupervised DA methods.
In a hybrid speech model, both voiced and unvoiced components can coexist in a segment. Often, the voiced speech is regarded as the deterministic component, and the unvoiced speech and additive noise are the stochastic components. Typically, the speech signal is considered stationary within fixed segments of 20–40 ms, but the degree of stationarity varies over time. For decomposing noisy speech into its voiced and unvoiced components, a fixed segmentation may be too crude, and we here propose to adapt the segment length according to the signal local characteristics. The segmentation relies on parameter estimates of a hybrid speech model and the maximum a posteriori (MAP) and log-likelihood criteria as rules for model selection among the possible segment lengths, for voiced and unvoiced speech, respectively. Given the optimal segmentation markers and the estimated statistics, both components are estimated using linear filtering. A codebook-based approach differentiates between unvoiced speech and noise. A better extraction of the components is possible by taking into account the adaptive segmentation, compared to a fixed one. Also, a lower distortion for voiced speech and higher segSNR for both components is possible, as compared to other decomposition methods.
Predicting the altered acoustic frames is an effective way of self-supervised learning for speech representation. However, it is challenging to prevent the pretrained model from overfitting. In this paper, we proposed to introduce two dropout regularization methods into the pretraining of transformer encoder: (1) attention dropout, (2) layer dropout. Both of the two dropout methods encourage the model to utilize global speech information, and avoid just copying local spectrum features when reconstructing the masked frames. We evaluated the proposed methods on phoneme classification and speaker recognition tasks. The experiments demonstrate that our dropout approaches achieve competitive results, and improve the performance of classification accuracy on downstream tasks.
We propose a pitch stylization technique in the presence of pitch halving and doubling errors. The technique uses an optimization criterion based on a minimum mean absolute error to make the stylization robust to such pitch estimation errors, particularly under noisy conditions. We obtain segments for the stylization automatically using dynamic programming. Experiments are performed at the frame level and the syllable level. At the frame level, the closeness of stylized pitch is analyzed with the ground truth pitch, which is obtained using a laryngograph signal, considering root mean square error (RMSE) measure. At the syllable level, the effectiveness of perceptual relevant embeddings in the stylized pitch is analyzed by estimating syllabic tones and comparing those with manual tone markings using the Levenshtein distance measure. The proposed approach performs better than a minimum mean squared error criterion based pitch stylization scheme at the frame level and a knowledge-based tone estimation scheme at the syllable level under clean and 20dB, 10dB and 0dB SNR conditions with five noises and four pitch estimation techniques. Among all the combinations of SNR, noise and pitch estimation techniques, the highest absolute RMSE and mean distance improvements are found to be 6.49Hz and 0.23, respectively.
Advancement in speech technology has brought convenience to our life. However, the concern is on the rise as speech signal contains multiple personal attributes, which would lead to either sensitive information leakage or bias toward decision. In this work, we propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism. Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes, to derive an identity-free representation for speech emotion recognition (SER), and an emotionless representation for speaker verification (SV). Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV, comparing to the current state-of-the-art method of using adversarial learning applied on a large emotion corpora, the MSP-Podcast. Also, our proposed learning strategy reduces the model and training process needed to achieve multiple privacy-preserving tasks.
We propose a novel sequence-to-sequence acoustic-to-articulatory inversion (AAI) neural architecture in the temporal waveform domain. In contrast to traditional AAI approaches that leverage hand-crafted short-time spectral features obtained from the windowed signal, such as LSFs, or MFCCs, our solution directly process the input speech signal in the time domain, avoiding any intermediate signal transformation, using a cascade of 1D convolutional filters in a deep model. The time-rate synchronization between raw speech signal and the articulatory signal is obtained through a decimation process that acts upon each convolution step. Decimation in time thus avoids degradation phenomena observed in the conventional AAI procedure, caused by the need of framing the speech signal to produce a feature sequence that perfectly matches the articulatory data rate. Experimental evidence on the “Haskins Production Rate Comparison” corpus demonstrates the effectiveness of the proposed solution, which outperforms a conventional state-of-the-art AAI system leveraging MFCCs with an 20% relative improvement in terms of Pearson correlation coefficient (PCC) in mismatched speaking rate conditions. Finally, the proposed approach attains the same accuracy as the conventional AAI solution in the typical matched speaking rate condition.
Phonetic analysis often requires reliable estimation of formants, but estimates provided by popular programs can be unreliable. Recently, Dissen et al. [1] described DNN-based formant trackers that produced more accurate frequency estimates than several others, but require manually-corrected formant data for training. Here we describe a novel unsupervised training method for corpus-based DNN formant parameter estimation and tracking with accuracy similar to [1]. Frame-wise spectral envelopes serve as the input. The output is estimates of the frequencies and bandwidths plus amplitude adjustments for a prespecified number of poles and zeros, hereafter referred to as “formant parameters.” A custom loss measure based on the difference between the input envelope and one generated from the estimated formant parameters is calculated and back-propagated through the network to establish the gradients with respect to the formant parameters. The approach is similar to that of autoencoders, in that the model is trained to reproduce its input in order to discover latent features, in this case, the formant parameters. Our results demonstrate that a reliable formant tracker can be constructed for a speech corpus without the need for hand-corrected training data.
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL for its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel the research in representation learning and general speech processing.
Generating synthesised singing voice with models trained on speech data has many advantages due to the models’ flexibility and controllability. However, since the information about the temporal relationship between segments and beats are lacking in speech training data, the synthesised singing may sound off-beat at times. Therefore, the availability of the information on the temporal relationship between speech segments and music beats is crucial. The current study investigated the segment-beat synchronisation in singing data, with hypotheses formed based on the linguistics theories of P-centre and sonority hierarchy. A Mandarin corpus and an English corpus of professional singing data were manually annotated and analysed. The results showed that the presence of musical beats was more dependent on segment duration than sonority. However, the sonority hierarchy and the P-centre theory were highly related to the location of beats. Mandarin and English demonstrated cross-linguistic variations despite exhibiting common patterns.
Learned speech representations can drastically improve performance on tasks with limited labeled data. However, due to their size and complexity, learned representations have limited utility in mobile settings where run-time performance can be a significant bottleneck. In this work, we propose a class of lightweight non-semantic speech embedding models that run efficiently on mobile devices based on the recently proposed TRILL speech embedding. We combine novel architectural modifications with existing speed-up techniques to create embedding models that are fast enough to run in real-time on a mobile device and exhibit minimal performance degradation on a benchmark of non-semantic speech tasks. One such model (FRILL) is 32× faster on a Pixel 1 smartphone and 40% the size of TRILL, with an average decrease in accuracy of only 2%. To our knowledge, FRILL is the highest-quality non-semantic embedding designed for use on mobile devices. Furthermore, we demonstrate that these representations are useful for mobile health tasks such as non-speech human sounds detection and face-masked speech detection. Our models and code are publicly available.
In everyday conversation, speakers’ utterances often overlap. For conversation corpora that are recorded in diverse environments, results of pitch extraction in the overlapping parts may be incorrect. The goal of this study is to establish the technique of separating each speaker’s pitch contour from an overlapping speech in conversation. The proposed method estimates statistically most plausible fo contour from the spectrogram of overlapping speech, along with the information of the speaker to extract. Visual inspection of the separation results showed that the proposed model was able to extract accurate fo contours from overlapping speeches of specified speakers. By applying this method, voicing decision errors and gross pitch errors were reduced by 63% compared to simple pitch extraction for overlapping speech.
Transfer learning is critical for efficient information transfer across multiple related learning problems. A simple, yet effective transfer learning approach utilizes deep neural networks trained on a large-scale task for feature extraction. Such representations are then used to learn related downstream tasks. In this paper, we investigate transfer learning capacity of audio representations obtained from neural networks trained on a large-scale sound event detection dataset. We build and evaluate these representations across a wide range of other audio tasks, via a simple linear classifier transfer mechanism. We show that such simple linear transfer is already powerful enough to achieve high performance on the downstream tasks. We also provide insights into the attributes of sound event representations that enable such efficient information transfer.
Creaky voice is a nonmodal phonation type that has various linguistic and sociolinguistic functions. Manually annotating creaky voice for phonetic analysis is time-consuming and labor-intensive. In recent years, automatic tools for detecting creaky voice have been proposed, which present the possibility for easier, faster and more consistent creak identification. One of these proposed tools is a Creak Detector algorithm that uses an automatic neural network taking its input from several acoustic cues to identify creaky voice. Previous work has suggested that the creak probability threshold at which this tool determines an instance to be creaky may vary depending on the speaker population. The present study investigates the optimal creak detection threshold for female Australian English speakers. Results show further support for the practice of first finding the optimal threshold when using the Creak Detection algorithm on new data sets. Additionally, results show that accuracy of creaky voice detection using the Creak Detection algorithm can be significantly improved by excluding non-sonorant data.
There has been a recent increase in speech research utilizing data recorded with participants’ personal devices, particularly in light of the COVID-19 pandemic and restrictions on face-to-face interactions. This raises important questions about whether these recordings are comparable to those made in traditional lab-based settings. Some previous studies have compared the viability of recordings made with personal devices for the clinical evaluation of voice quality. However, these studies rely on simple statistical analyses and do not examine acoustic correlates of voice quality typically examined in the (socio-) phonetic literature (e.g. H1-H2). In this study, we compare recordings from a set of smartphones/laptops and a solid-state recorder to assess the reliability of a range of acoustic correlates of voice quality. The results show significant differences for many acoustic measures of voice quality across devices. Further exploratory analyses demonstrate that these differences are not simple offsets, but rather that their magnitude depends on the value of the measurement of interest. We therefore urge researchers to exercise caution when examining voice quality based on recordings made with participants’ devices, particularly when interested in small effect sizes. We also call on the speech research community to investigate these issues more thoroughly.
The current study investigates voice quality characteristics of Greek adults with normal hearing and hearing loss, automatically obtained from glottal inverse filtering analysis using the Aalto Aparat toolkit. Aalto Aparat has been employed in glottal flow analysis of disordered speech, but to the best of the authors’ knowledge, not as yet in hearing impaired voice analysis and assessment. Five speakers, three women and two men, with normal hearing (NH) and five speakers with prelingual profound hearing impairment (HI), matched for age and sex, produced symmetrical /ˈpVpV/ disyllables, where V=/i, a, u/. A state-of-the-art method named quasi-closed phase analysis (QCP) is offered in Aparat and it is used to estimate the glottal source signal. Glottal source features were obtained using time- and frequency-domain parametrization methods and analysed statistically. The interpretation of the results attempts to shed light on potential differences between HI and NH phonation strategies, while advantages and limitations of inverse filtering methods in HI voice assessment are discussed.
The Saarbrücken Voice Database contains speech and simultaneous electroglottography recordings of 1002 speakers exhibiting a wide range of voice disorders, together with recordings of 851 controls. Previous studies have used this database to build systems for automated detection of voice disorders and for differential diagnosis. These studies have varied considerably in the subset of pathologies tested, the audio materials analyzed, the cross-validation method used and the performance metric reported. This variation has made it hard to determine the most promising approaches to the problem of detecting voice disorders. In this study we re-implement three recently published systems that have been trained to detect pathology using the SVD and compare their performance on the same pathologies with the same audio materials using a common cross-validation protocol and performance metric. We show that under this approach, there is much less difference in performance across systems than in their original publication. We also show that voice disorder detection on the basis of a short phrase gives similar performance to that based on a sequence of vowels of different pitch. Our evaluation protocol may be useful for future studies on voice disorder detection with the SVD.
Non-invasive measures of voice quality, such as H1-H2, rely on oral flow signals, inverse filtered speech signals, or corrections for the effects of formants. Voice quality measures play especially important roles in the assessment of voice disorders and the evaluation of treatment efficacy. One type of treatment that is increasingly common in voice therapy, as well as in voice training for singers and actors, is semi-occluded vocal tract exercises (SOVTEs). The goal of SOVTEs is to change patterns of vocal fold vibration and thereby improve voice quality and vocal efficiency. Accelerometers applied to the skin of the neck have been used to investigate subglottal acoustics, to inverse-filter speech signals, and to obtain voice quality metrics. This paper explores the application of neck-skin accelerometers to measure voice quality without oral flow, inverse filtering, or formant correction. Accelerometer-based measures (uncorrected K1-K2 and corrected K1*-K2*, analogous to microphone-based H1-H2 and H1*-H2*) were obtained from typically developing children with healthy voice, before and during SOVTEs. Traditional microphone-based H1-H2 measures (corrected and uncorrected) were also obtained. Results showed that K1-K2 and K1*-K2* were not substantially affected by vocal tract acoustic changes in formant frequencies.
Huntington Disease (HD) is a progressive disorder which often manifests in motor impairment. Motor severity (captured via motor score) is a key component in assessing overall HD severity. However, motor score evaluation involves in-clinic visits with a trained medical professional, which are expensive and not always accessible. Speech analysis provides an attractive avenue for tracking HD severity because speech is easy to collect remotely and provides insight into motor changes. HD speech is typically characterized as having irregular articulation. With this in mind, acoustic features that can capture vocal tract movement and articulatory coordination are particularly promising for characterizing motor symptom progression in HD. In this paper, we present an experiment that uses Vocal Tract Coordination (VTC) features extracted from read speech to estimate a motor score. When using an elastic-net regression model, we find that VTC features significantly outperform other acoustic features across varied-length audio segments, which highlights the effectiveness of these features for both short- and long-form reading tasks. Lastly, we analyze the F-value scores of VTC features to visualize which channels are most related to motor score. This work enables future research efforts to consider VTC features for acoustic analyses which target HD motor symptomatology tracking.
Dysphonia comprises many perceptually deviating aspects of voice, and its overall severity perception is made by the listener according to methods of aggregating the single dimensions which are personally conceived and not well studied. Roughness and breathiness are constituent dimensions in most devised rating scales in clinical use. In this paper, we evaluate several ways to model the mapping of the overall severity as a function of the particular ratings of roughness and breathiness. The models include the simple linear averaging as well as several non-linear variants suggested elsewhere, and some minor adjustments. The models are evaluated on four datasets from different countries, allowing a more global evaluation of how the mapping is conceived. Results show the limitations of the most widely assumed linear approach, while also hinting at a need for a more uniform coverage of the sample space in voice pathology datasets. The models explored in this paper can be expanded to higher-dimensional scales.
Parkinson’s disease (PD) is a central nervous system disorder that causes motor impairment. Recent studies have found that people with PD also often suffer from cognitive impairment (CI). While a large body of work has shown that speech can be used to predict motor symptom severity in people with PD, much less has focused on cognitive symptom severity. Existing work has investigated if acoustic features, derived from speech, can be used to detect CI in people with PD. However, these acoustic features are general and are not targeted toward capturing CI. Speech errors and disfluencies provide additional insight into CI. In this study, we focus on read speech, which offers a controlled template from which we can detect errors and disfluencies, and we analyze how errors and disfluencies vary with CI. The novelty of this work is an automated pipeline, including transcription and error and disfluency detection, capable of predicting CI in people with PD. This will enable efficient analyses of how cognition modulates speech for people with PD, leading to scalable speech assessments of CI.