| Total: 170
The emergence of deep neural networks has made speech enhancement well developed. Most of the early models focused on estimating the magnitude of spectrum while ignoring the phase, this gives the evaluation result a certain upper limit. Some recent researches proposed deep complex network, which can handle complex inputs, and realize joint estimation of magnitude spectrum and phase spectrum by outputting real and imaginary parts respectively. The encoder-decoder structure in Deep Complex U-net (DCU) has been proven to be effective for complex-valued data. To further improve the performance, in this paper, we design a new network called Funnel Deep Complex U-net (FDCU), which could process magnitude information and phase information separately through one-encoder-two-decoders structure. Moreover, in order to achieve better training effect, we define negative stretched-SI-SNR as the loss function to avoid errors caused by the negative vector angle. Experimental results show that our FDCU model outperforms state-of-the-art approaches in all evaluation metrics.
Despite much progress, most temporal convolutional networks (TCN) based speech enhancement models are mainly focused on modeling the long-term temporal contextual dependencies of speech frames, without taking into account the distribution information of speech signal in frequency dimension. In this study, we propose a frequency dimension adaptive attention (FAA) mechanism to improve TCNs, which guides the model selectively emphasize the frequency-wise features with important speech information and also improves the representation capability of network. Our extensive experimental investigation demonstrates that the proposed FAA mechanism is able to consistently provide significant improvements in terms of speech quality (PESQ), intelligibility (STOI) and three other composite metrics. More promisingly, it has better generalization ability to real-world noisy environment.
Many early studies reported the importance of vowels and vowel-consonant transitions to speech intelligibility. The present work assessed their perceptual impacts to the understanding of time-compressed sentences, which could be used to measure the temporal acuity during speech understanding. Mandarin sentences were edited to selectively preserve vowel centers or vowel-consonant transitional segments, and compress the rest regions with equipment time compression rates (TCRs) up to 3, including conditions only preserving vowel centers or vowel-consonant transitions. The processed stimuli were presented to normal-hearing listeners to recognize. Results showed that, consistent with the segmental contributions in understanding uncompressed speech, the vowel-only time-compressed stimuli were highly intelligible (i.e., intelligibility score >85%) at a TCR around 3, and vowel-consonant transitions carried important intelligibility information in understanding time-compressed sentences. The time-compression conditions in the present work provided higher intelligibility scores than their counterparties in understanding the PSOLA-processed time-compressed sentences with TCRs around 3. The findings in this work suggested that the design of time compression processing could be guided towards selectively preserving perceptually important speech segments (e.g., vowels) in the future.
In a recent work [1], a novel Delta Function-based Formant Shifting approach was proposed for speech intelligibility improvement. The underlying principle is to dynamically relocate the formants based on their occurrence in the spectrum away from the region of noise. The manner in which the formants are shifted is decided by the parameters of the Delta Function, the optimal values of which are evaluated using Comprehensive Learning Particle Swarm Optimization (CLPSO). Although effective, CLPSO is computationally expensive to the extent that it overshadows its merits in intelligibility improvement. As a solution to this, the current work aims to improve the Short-Time Objective Intelligibility (STOI) of (target) speech using a Delta Function that has been generated using a different (source) language. This transfer learning is based upon the relative positioning of the formant frequencies and pitch values of the source & target language datasets. The proposed approach is demonstrated and validated by subjecting it to experimentation with three different languages under variable noisy conditions.
Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it difficult to conduct experiments in a laboratory. One solution is to perform remote testing using crowdsourcing; however, because we cannot control the listening conditions, it is unclear whether the results are entirely reliable. In this study, we compared the speech intelligibility scores obtained from remote and laboratory experiments. The results showed that the mean and standard deviation (SD) of the remote experiments’ speech reception threshold (SRT) were higher than those of the laboratory experiments. However, the variance in the SRTs across the speech-enhancement conditions revealed similarities, implying that remote testing results may be as useful as laboratory experiments to develop an objective measure. We also show that practice session scores are correlated with SRT values. This is a priori information before performing the main tests and would be useful for data screening to reduce the variability of the SRT distribution.
Traditional spectral subtraction-type single channel speech enhancement (SE) algorithms often need to estimate interference components including noise and/or reverberation before subtracting them while deep neural network-based SE methods often aim to realize the end-to-end target mapping. In this paper, we show that both denoising and dereverberation can be unified into a common problem by introducing a two-stage paradigm, namely for interference components estimation and speech recovery. In the first stage, we propose to explicitly extract the magnitude of interference components, which serves as the prior information. In the second stage, with the guidance of this estimated magnitude prior, we can expect to better recover the target speech. In addition, we propose a transform module to facilitate the interaction between interference components and the desired speech modalities. Meanwhile, a temporal fusion module is designed to model long-term dependencies without ignoring short-term details. We conduct the experiments on the WSJ0-SI84 corpus and the results on both denoising and dereverberation tasks show that our approach outperforms previous advanced systems and achieves state-of-the-art performance in terms of many objective metrics.
Speech enhancement is a task to improve the intelligibility and perceptual quality of degraded speech signals. Recently, neural network-based methods have been applied to speech enhancement. However, many neural network-based methods require users to collect clean speech and background noise for training, which can be time-consuming. In addition, speech enhancement systems trained on particular types of background noise may not generalize well to a wide range of noise. To tackle those problems, we propose a speech enhancement framework trained on weakly labelled data. We first apply a pretrained sound event detection system to detect anchor segments that contain sound events in audio clips. Then, we randomly mix two detected anchor segments as a mixture. We build a conditional source separation network using the mixture and a conditional vector as input. The conditional vector is obtained from the audio tagging predictions on the anchor segments. In inference, we input a noisy speech signal with the one-hot encoding of “Speech” as a condition to the trained system to predict enhanced speech. Our system achieves a PESQ of 2.28 and an SSNR of 8.75 dB on the VoiceBank-DEMAND dataset, outperforming the previous SEGAN system of 2.16 and 7.73 dB respectively.
Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e.g. phones and syllables. In this study, we propose a novel phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models. To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model, a powerful self-supervised encoder that renders rich phonetic information. To more accurately measure the distribution distances of the latent representations, the PFPL adopts the Wasserstein distance as the distance measure. Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses. Moreover, the results showed that the PFPL can enable a deep complex U-Net SE model to achieve highly competitive performance in terms of standardized quality and intelligibility evaluations on the Voice Bank–DEMAND dataset.
The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory. Objective evaluation metrics which consider human perception can hence serve as a bridge to reduce the gap. Our previously proposed MetricGAN was designed to optimize objective metrics by connecting the metric with a discriminator. Because only the scores of the target evaluation functions are needed during training, the metrics can even be non-differentiable. In this study, we propose a MetricGAN+ in which three training techniques incorporating domain-knowledge of speech processing are proposed. With these techniques, experimental results on the VoiceBank-DEMAND dataset show that MetricGAN+ can increase PESQ score by 0.3 compared to the previous MetricGAN and achieve state-of-the-art results (PESQ score = 3.15).
We propose a monaural intrusive speech intelligibility prediction (SIP) algorithm called STGI based on detecting glimpses in short-time segments in a spectro-temporal modulation decomposition of the input speech signals. Unlike existing glimpse-based SIP methods, the application of STGI is not limited to additive uncorrelated noise; STGI can be employed in a broad range of degradation conditions. Our results show that STGI performs consistently well across 15 datasets covering degradation conditions including modulated noise, noise reduction processing, reverberation, near-end listening enhancement, checkerboard noise, and gated noise.
For speech enhancement, deep complex network based methods have shown promising performance due to their effectiveness in dealing with complex-valued spectrums. Recent speech enhancement methods focus on further optimization of network structures and hyperparameters, however, ignore inherent speech characteristics (e.g., phonetic characteristics), which are important for networks to learn and reconstruct speech information. In this paper, we propose a novel self-supervised learning based phone-fortified (SSPF) method for speech enhancement. Our method explicitly imports phonetic characteristics into a deep complex convolutional network via a Contrastive Predictive Coding (CPC) model pre-trained with self-supervised learning. This operation can greatly improve speech representation learning and speech enhancement performance. Moreover, we also apply the self-attention mechanism to our model for learning long-range dependencies of a speech sequence, which further improves the performance of speech enhancement. The experimental results demonstrate that our SSPF method outperforms existing methods and achieves state-of-the-art performance in terms of speech quality and intelligibility.
Objective measures of success, such as the perceptual evaluation of speech quality (PESQ), signal-to-distortion ratio (SDR), and short-time objective intelligibility (STOI), have recently been used to optimize deep-learning based speech enhancement algorithms, in an effort to incorporate perceptual constraints into the learning process. Optimizing with these measures, however, may be sub-optimal, since the objective scores do not always strongly correlate with a listener’s evaluation. This motivates the need for approaches that either are optimized with scores that are strongly correlated with human assessments or that use alternative strategies for incorporating perceptual constraints. In this work, we propose an attention-based approach that uses learned speech embedding vectors from a mean-opinion score (MOS) prediction model and a speech enhancement module to jointly enhance noisy speech. Our loss function is jointly optimized with signal approximation and MOS prediction loss terms. We train the model using real-world noisy speech data that has been captured in everyday environments. The results show that our proposed model significantly outperforms other approaches that are optimized with objective measures.
There are many deterministic mathematical operations (e.g. compression, clipping, downsampling) that degrade speech quality considerably. In this paper we introduce a neural network architecture, based on a modification of the DiffWave model, that aims to restore the original speech signal. DiffWave, a recently published diffusion-based vocoder, has shown state-of-the-art synthesized speech quality and relatively shorter waveform generation times, with only a small set of parameters. We replace the mel-spectrum upsampler in DiffWave with a deep CNN upsampler, which is trained to alter the degraded speech mel-spectrum to match that of the original speech. The model is trained using the original speech waveform, but conditioned on the degraded speech mel-spectrum. Post-training, only the degraded mel-spectrum is used as input and the model generates an estimate of the original speech. Our model results in improved speech quality (original DiffWave model as baseline) on several different experiments. These include improving the quality of speech degraded by LPC-10 compression, AMR-NB compression, and signal clipping. Compared to the original DiffWave architecture, our scheme achieves better performance on several objective perceptual metrics and in subjective comparisons. Improvements over baseline are further amplified in a out-of-corpus evaluation setting.
Verbal communication in daily use is conducted in the form of continuous speech that theoretically is the ideal data format for assessing oral language ability in educational and clinical domains. But as phonetic reduction and particularly lexical tones in Chinese are greatly affected by discourse context, it is a challenging task for automatic systems to evaluate continuous speech only by acoustic features. This study analyzed repetitive and storytelling speech produced by selected Chinese-speaking hearing and hearing-impaired children with distinctively high and low speech intelligibility levels. Word-based reduction types are derived by phonological properties that characterize contraction degrees of automatically generated surface forms of disyllabic words. F0-based tonal contours are visualized using the centroid-nearest data points in the major clusters computed for tonal syllables. Our results show that primary speech characteristics across different groups of children can be differentiated by means of reduction type and tone production.
For patients suffering with high-frequency hearing loss and preserving low-frequency hearing, combined electric-acoustic stimulation (EAS) may significantly improve their speech perception compared with cochlear implants (CIs). In combined EAS, a hearing aid provides low-frequency information via acoustic (A) stimulation and a CI evokes high-frequency sound sensation via electrical (E) stimulation. The present work investigated the EAS advantage when only a small number (i.e., 1 or 2) of channels were provided for electrical stimulation in a CI, and the effect of carrier bandwidth on understanding Mandarin sentences in a simulation of combined EAS experiment. The A-portion was extracted via low-pass filtering processing and the E-portion was generated with a vocoder model preserving multi-channel temporal envelope waveforms, whereas a noise-vocoder and a tone-vocoder were used to simulate the effect of carrier bandwidth. The synthesized stimuli were presented to normal-hearing listeners to recognize. Experimental results showed that while low-pass filtered Mandarin speech was not very intelligible, adding one or two E channels could significantly improve the intelligibility score to above 86.0%. Under the condition with one E channel, using a large carrier bandwidth in noise-vocoder processing provided a better intelligibility performance than using a narrow carrier bandwidth in tone-vocoder processing.
Correcting the deficits in jaw movements have often been ignored in assessment and treatment of speech disorders. A robotic simulation is being developed to facilitate Speech Language Pathologists to demonstrate the movement of jaw, tongue and teeth during production of speech sounds, as a part of a larger study. Profiling of jaw movement is an important aspect of articulatory simulation. The present study attempts to develop a simple and efficient technique for deriving the jaw parameters and using them to simulate jaw movements through inverse kinematics. Three Kannada speaking male participants in the age range of 26 to 33 years were instructed to produce selected speech sounds. The image of the final position of the jaw during production of each speech sound was recorded through CT scan and video camera. Angle of ramus and angle of body of mandible were simulated through inverse kinematics using RoboAnalyzer software. The variables for inverse kinematics were derived through kinematic analysis. The Denavit-Hartenberg (D-H) parameters required for kinematic analysis were obtained from still image. Angles simulated were compared with the angles obtained from CT scan images. No significant difference was observed.
Cued Speech (CS) is a communication system for deaf people or hearing impaired people, in which a speaker uses it to aid a lipreader in phonetic level by clarifying potentially ambiguous mouth movements with hand shape and positions. Feature extraction of multi-modal CS is a key step in CS recognition. Recent supervised deep learning based methods suffer from noisy CS data annotations especially for hand shape modality. In this work, we first propose a self-supervised contrastive learning method to learn the feature representation of image without using labels. Secondly, a small amount of manually annotated CS data are used to fine-tune the first module. Thirdly, we present a module, which combines Bi-LSTM and self-attention networks to further learn sequential features with temporal and contextual information. Besides, to enlarge the volume and the diversity of the current limited CS datasets, we build a new British English dataset containing 5 native CS speakers. Evaluation results on both French and British English datasets show that our model achieves over 90% accuracy in hand shape recognition. Significant improvements of 8.75% (for French) and 10.09% (for British English) are achieved in CS phoneme recognition correctness compared with the state-of-the-art.
The ease of in-the-wild speech recording using smartphones has sparked considerable interest in the combined application of speech, remote measurement technology (RMT) and advanced analytics as a research and healthcare tool. For this to be realised, the acceptability of remote speech collection to the user must be established, in addition to feasibility from an analytical perspective. To understand the acceptance, facilitators, and barriers of smartphone-based speech recording, we invited 384 individuals with major depressive disorder (MDD) from the Remote Assessment of Disease and Relapse — Central Nervous System (RADAR-CNS) research programme in Spain and the UK to complete a survey on their experiences recording their speech. In this analysis, we demonstrate that study participants were more comfortable completing a scripted speech task than a free speech task. For both speech tasks, we found depression severity and country to be significant predictors of comfort. Not seeing smartphone notifications of the scheduled speech tasks, low mood and forgetfulness were the most commonly reported obstacles to providing speech recordings.
Characterizing accurate vs. misarticulated patterns of tongue movement using ultrasound can be challenging in real time because of the fast, independent movement of tongue regions. The usefulness of ultrasound for biofeedback speech therapy is limited because speakers must mentally track and compare differences between their tongue movement and available models. It is desirable to automate this interpretive task using a single parameter representing deviation from known accurate tongue movements. In this study, displacements recorded automatically by ultrasound image tracking were transformed into a single biofeedback parameter (time-dependent difference between blade and dorsum displacements). Receiver operating characteristic (ROC) curve analysis was used to evaluate this parameter as a predictor of production accuracy over a range of different vowel contexts with initial and final /r/ in American English. Areas under ROC curves were 0.8 or above, indicating that this simple parameter may provide useful real-time biofeedback on /r/ accuracy within a range of rhotic contexts.
We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal speech. Although these two properties are statistically significant across speaking modes, they do not directly correlate with word error rates from speech recognition.
Speech is our most natural and efficient way of communication and offers a strong potential to improve how we interact with machines. However, speech communication can sometimes be limited by environmental (e.g., ambient noise), contextual (e.g., need for privacy in a public place), or health conditions (e.g., laryngectomy), hindering the consideration of audible speech. In this regard, silent speech interfaces (SSI) have been proposed (e.g., considering video, electromyography), however, many technologies still face limitations regarding their everyday use, e.g., the need to place equipment in contact with the speaker (e.g., electrodes/ultrasound probe), and raise technical (e.g., lighting conditions for video) or privacy concerns. In this context, the consideration of technologies that can help tackle these issues, e.g, by being contactless and/or placed in the environment, can foster the widespread use of SSI. In this article, continuous-wave radar is explored to assess its potential for SSI. To this end, a corpus of 13 words was acquired, for 3 speakers, and different classifiers were tested on the resulting data. The best results, obtained using Bagging classifier, trained for each speaker, with 5-fold cross-validation, yielded an average accuracy of 0.826, an encouraging result that establishes promising grounds for further exploration of this technology for silent speech recognition.
Silent speech interfaces (SSIs) are devices that convert non-audio bio-signals to speech, which hold the potential of recovering quality speech for laryngectomees (people who have undergone laryngectomy). Although significant progress has been made, most of the recent SSI works focused on data collected from healthy speakers. SSIs for laryngectomees have rarely been investigated. In this study, we investigated the reconstruction of speech for two laryngectomees who either use tracheoesophageal puncture (TEP) or electro-larynx (EL) speech as their post-surgery communication mode. We reconstructed their speech using two SSI designs (1) real-time recognition-and-synthesis and (2) directly articulation-to-speech synthesis (ATS). The reconstructed speech samples were measured in subjective evaluation by 20 listeners in terms of naturalness and intelligibility. The results indicated that both designs increased the naturalness of alaryngeal speech. The real-time recognition-and-synthesis design obtained higher intelligibility in electrolarynx speech as well, while the ATS did not. These preliminary results suggest the real-time recognition-and-synthesis design may have a better potential for clinical applications (for laryngectomees) than ATS.
Fundamental frequency (f0) estimation, also known as pitch tracking, has been a long-standing research topic in the speech and signal processing community. Many pitch estimation algorithms, however, fail in noisy conditions or introduce large delays due to their frame size or Viterbi decoding. In this study, we propose a deep learning-based pitch estimation algorithm, LACOPE, which was trained in a joint pitch estimation and speech enhancement framework. In contrast to previous work, this algorithm allows for a configurable latency down to an algorithmic delay of 0. This is achieved by exploiting the smoothness properties of the pitch trajectory. That is, a recurrent neural network compensates delay introduced by the feature computation by predicting the pitch for a desired point, allowing a trade-off between pitch accuracy and latency. We integrate the pitch estimation in a speech enhancement framework for hearing aids. For this application, we allow a delay on the analysis side of approx. 5ms. The pitch estimate is then used for constructing a comb filter in frequency domain as post-processing step to remove intra-harmonic noise. Our pitch estimation performance is on par with SOTA algorithms like PYIN or CREPE for spoken speech in all noise conditions while introducing minimal latency.
This paper proposes α-stable autoregressive fast multichannel nonnegative matrix factorization (α-AR-FastMNMF), a robust joint blind speech enhancement and dereverberation method for improved automatic speech recognition in a realistic adverse environment. The state-of-the-art versatile blind source separation method called FastMNMF that assumes the short-time Fourier transform (STFT) coefficients of a direct sound to follow a circular complex Gaussian distribution with jointly-diagonalizable full-rank spatial covariance matrices was extended to AR-FastMNMF with an autoregressive reverberation model. Instead of the light-tailed Gaussian distribution, we use the heavy-tailed α-stable distribution, which also has the reproductive property useful for the additive source modeling, to better deal with the large dynamic range of the direct sound. The experimental results demonstrate that the proposed α-AR-FastMNMF works well as a front-end of an automatic speech recognition system. It outperforms α-AR-ILRMA, which is a special case of α-AR-FastMNMF, and their Gaussian counterparts, i.e., AR-FastMNMF and AR-ILRMA, in terms of the speech signal quality metrics and word error rate.