| Total: 53
This paper presents a fast and effective algorithm for enhancing speech intelligibility in additive noise conditions under the constraint of equal signal power before and after enhancement. Speech energy is reallocated in time, using dynamic range compression, and in frequency by boosting the signal to noise ratio in high frequencies, increasing the contrast between consecutive spectral peaks and valleys for the mid-frequencies, while maintaining the spectral energy in low frequencies. The algorithm has 90% lower computational load than similar and recently suggested state-of-the art approaches, while in large formal speech-in-noise intelligibility tests, the algorithm has shown to perform equally well to these methods in terms of intelligibility gains.
Clear speech has been shown to have an intelligibility advantage over casual speech in noisy and reverberant environments. This work validates spectral and time domain modifications to increase the intelligibility of casual speech in reverberant environments by compensating particular differences between the two speaking styles. To compensate spectral differences, a frequency-domain filtering approach is applied to casual speech. In time domain, two techniques for time-scaling casual speech are explored: (1) uniform time-scaling and (2) pause insertion and phoneme elongation based on loudness and modulation criteria. The effect of the proposed modifications is evaluated through subjective listening tests in two reverberant conditions with reverberation time 0.8s and 2s. The combination of spectral transformation and uniform time-scaling is shown to be the most successful in increasing the intelligibility of casual speech. The evaluation results support the conclusion that modifications inspired by clear speech can be beneficial for the intelligibility enhancement of speech in reverberant environments.
Listeners suffering from presbycusis (age-related hearing loss) often report difficulties when attempting to understand vocal announcements in public spaces. Current solutions that improve speech intelligibility for hearing-impaired subjects mainly consist of customized solutions, such as hearing aids. This study proposes a more generic strategy, which would enhance speech perception for both normal-hearing and hearing-impaired listeners, i.e. For All. It provides the early stages of such an approach. Digital filters with different degrees of hearing-loss compensation have been designed, getting inspired by the way hearing aids process speech signals. Subjective tests conducted on normal-hearing and presbycusis subjects confirmed that it is possible to improve speech intelligibility for both types of population simultaneously.
Speech intelligibility is an important factor for successful speech communication in today's society. So-called near-end listening enhancement (NELE) algorithms aim at improving speech intelligibility in conditions where the (clean) speech signal is accessible and can be modified prior to its presentation. However, many of these algorithms only consider the detrimental effect of noise and disregard the effect of reverberation. Therefore, in this paper we propose to additionally incorporate the detrimental effects of reverberation into noise-adaptive near-end listening enhancement algorithms. Based on the Speech Transmission Index (STI), which is widely used for speech intelligibility prediction, the effect of reverberation is effectively accounted for as an additional noise power term. This combined noise power term is used in a state-of-the-art noise-adaptive NELE algorithm. Simulations using two objective measures, the STI and the short-time objective intelligibility (STOI) measure demonstrate the potential of the proposed approach to improve the predicted speech intelligibility in noisy and reverberant conditions.
The `Lombard effect' consists of various speech adaptation mechanisms human speakers use involuntarily to counter influences that a noisy environment has on their speech intelligibility. These adaptations are highly dependent on the characteristics of the noise and happen rapidly. Modelling the effect for the output side of speech interfaces is therefore difficult: the noise characteristics need to be evaluated continuously and speech synthesis adaptations need to take effect immediately. This paper describes and evaluates an online system consisting of a module that analyses the acoustic environment and a module that adapts the speech parameters of an incremental speech synthesis system in a timely manner. In an evaluation with human listeners the system had a similar effect on intelligibility as had human speakers in offline studies. Furthermore, during noise the Lombard-adapted speech was rated more natural than standard speech.
Intelligibility enhancement can be applied in mobile communications as a post-processing step when the background noise conditions are adverse. In this study, post-processing methods aiming to model the Lombard effect are investigated. More specifically, the study focuses on mapping the spectral tilt of normal speech to that of Lombard speech to improve intelligibility of telephone speech in near-end noise conditions. Two different modelling techniques, Gaussian mixture models (GMMs) and Gaussian processes (GPs), are evaluated with different amounts of training data. Normal-to-Lombard conversions implemented by GMMs and GPs are then compared objectively as well as in subjective intelligibility and quality tests with unprocessed speech in different noise conditions. All GMMs and GPs evaluated in the subjective tests were able to improve intelligibility without significant decrease in quality compared to unprocessed speech. While the best intelligibility results were obtained with a GP model, other GMM and GP alternatives were rated higher in quality. Based on the results, determining the best modelling technique for normal-to-Lombard mapping is challenging and calls for further studies.
In this paper, we address the problem of source localization and separation using sparse methods over a spherical microphone array. A sparsity based method is developed from the observed data in spherical harmonic domain. A solution to the sparse model formulated herein is obtained by imposing orthonormal constraint on the sparsity matrix. Subsequently, a splitting method based on bregman iteration is used to jointly localize and separate the sources from the mixtures of sources. A joint estimate of location and the separated sources is finally obtained after fixed number of iterations. Experiments on source localization and separation are conducted at different SNRs on the grid database. Experimental results based on RMSE analysis and objective evaluation indicate a reasonable performance improvement when compared to other methods in literature.
Non-negative matrix factorization (NMF) aims at finding non-negative representations of nonnegative data. Among different NMF algorithms, alternating direction method of multipliers (ADMM) is a popular one with superior performance. However, we find that ADMM shows instability and inferior performance on real-world data like speech signals. In this paper, to solve this problem, we develop a class of advanced regularized ADMM algorithms for NMF. Efficient and robust learning rules are achieved by incorporating l1-norm and the Frobenius norm regularization. The prior information of Laplacian distribution of data is used to solve the problem with a unique solution. We evaluate this class of ADMM algorithms using both synthetic and real speech signals for a source separation task at different cost functions, i.e., Euclidean distance (EUD), Kullback-Leibler (KL) divergence and Itakura-Saito (IS) divergence. Results demonstrate that the proposed algorithms converge faster and yield more stable and accurate results than the original ADMM algorithm.
Recently, supervised speech separation has been extensively studied and shown considerable promise. Due to the temporal continuity of speech, speech auditory features and separation targets present prominent spectro-temporal structures and strong correlations over the time-frequency (T-F) domain, which can be exploited for speech separation. However, many supervised speech separation methods independently model each T-F unit with only one target and much ignore these useful information. In this paper, we propose a two-stage multi-target joint learning method to jointly model the related speech separation targets at the frame level. Systematic experiments show that the proposed approach consistently achieves better separation and generalization performances in the low signal-to-noise ratio (SNR) conditions.
We propose a multi-objective framework to learn both secondary targets not directly related to the intended task of speech enhancement (SE) and the primary target of the clean log-power spectra (LPS) features to be used directly for constructing the enhanced speech signals. In deep neural network (DNN) based SE we introduce an auxiliary structure to learn secondary continuous features, such as mel-frequency cepstral coefficients (MFCCs), and categorical information, such as the ideal binary mask (IBM), and integrate it into the original DNN architecture for joint optimization of all the parameters. This joint estimation scheme imposes additional constraints not available in the direct prediction of LPS, and potentially improves the learning of the primary target. Furthermore, the learned secondary information as a byproduct can be used for other purposes, e.g., the IBM-based post-processing in this work. A series of experiments show that joint LPS and MFCC learning improves the SE performance, and IBM-based post-processing further enhances listening quality of the reconstructed speech.
Non-negative matrix factorization (NMF) is a dimensionality reduction method that usually leads to a part-based representation, and it is shown to be effective for source separation. However, the performance of the source separation degrades when one signal can be described with the bases for the other source signals. In this paper, we propose a discriminative NMF (DNMF) algorithm which exploits the reconstruction error for the other signals as well as the target signal based on target bases. The objective function to train the basis matrix is constructed to reward high reconstruction error for the other source signals in addition to low reconstruction error for the signal from the corresponding source. Experiments showed that the proposed method outperformed the standard NMF by about 0.26 in perceptual evaluation of speech quality score and 1.95 dB in signal-to-distortion ratio when it is applied to speech enhancement at input SNR of 0 dB.
This work proposes a method to exploit both audio and visual speech information to extract a target speaker from a mixture of competing speakers. The work begins by taking an effective audio-only method of speaker separation, namely the soft mask method, and modifying its operation to allow visual speech information to improve the separation process. The audio input is taken from a single channel and includes the mixture of speakers, and a separate set of visual features is extracted from each speaker. This allows modification of the separation process to include not only the audio speech but also visual speech from each speaker in the mixture. Experimental results are presented that compare the proposed audio-visual speaker separation with audio-only and visual-only methods using both speech quality and speech intelligibility metrics.
Recently, Google launched YouTube Kids, a mobile application for children, that uses a speech recognizer built specifically for recognizing children's speech. In this paper we present techniques we explored to build such a system. We describe the use of a neural network classifier to identify matched acoustic training data, filtering data for language modeling to reduce the chance of producing offensive results. We also compare long short-term memory (LSTM) recurrent networks to convolutional, LSTM, deep neural networks (CLDNN). We found that a CLDNN acoustic model outperforms an LSTM across a variety of different conditions, but does not specifically model child speech relatively better than adult. Overall, these findings allow us to build a successful, state-of-the-art large vocabulary speech recognizer for both children and adults.
Atypical speech prosody is a primary characteristic of autism spectrum disorders (ASD), yet it is often excluded from diagnostic instrument algorithms due to poor subjective reliability. Robust, objective prosodic cues can enhance our understanding of those aspects which are atypical in autism. In this work, we connect objective signal-derived descriptors of prosody to subjective perceptions of prosodic awkwardness. Subjectively, more awkward speech is less expressive (more monotone) and more often has perceived awkward rate/rhythm, volume, and intonation. We also find expressivity can be quantified through objective intonation variability features, and that speaking rate and rhythm cues are highly predictive of perceived awkwardness. Acoustic-prosodic features are also able to significantly differentiate subjects with ASD from typically developing (TD) subjects in a classification task, emphasizing the potential of automated methods for diagnostic efficiency and clarity.
Automatic speech recognition (ASR) for children's speech is more difficult than for adults' speech. A plausible explanation is that ASR errors are due to predictable phonological effects associated with language acquisition. We describe phone recognition experiments on hand labelled data for children aged between 5 and 9. A comparison of the resulting confusion matrices with those for adult speech (TIMIT) shows increased phone substitution rates for children, which correspond to some extent to established phonological phenomena. However these errors still only account for a relatively small proportion of the issue. This suggests that attempts to improve ASR accuracy on children's speech by accommodating these phenomena, for example by changing the pronunciation dictionary, cannot solve the whole problem.
In this paper we evaluate how speaker familiarity influences the engagement times and performance of blind school children when playing audio games made with different synthetic voices. We developed synthetic voices of school children, their teachers and of speakers that were unfamiliar to them and used each of these voices to create variants of two audio games: a memory game and a labyrinth game. Results show that pupils had significantly longer engagement times and better performance when playing games that used synthetic voices built with their own voices. This result was observed even though the children reported not recognising the synthetic voice as their own after the experiment was over. These findings could be used to improve the design of audio games and lecture books for blind and visually impaired children.
This work focuses on the issues and the challenges in acoustic adaptation in context of on-line children's speech recognition. When children's speech is decoded on adults' speech trained acoustic models, severely degraded recognition performance is noted on account of extreme acoustic mismatch. Though a number of conventional adaptation techniques are available, they are found to be undesirably latent for an on-line task. For addressing the same, in this work we have combined two low complexity fast adaptation techniques, namely acoustic model interpolation and low-rank feature projection. Two schemes for doing the same are presented in this work. In the first approach, model interpolation is done using weights estimated in unconstrained fashion. The other approach is a hybrid one in which a set mean supervectors are pre-estimated using suitable developmental data. Those are then optimally scaled using the given test data. Though the unconstrained approach results in better improvements over baseline, it has a higher complexity and memory requirements. In case of the hybrid approach, for interpolating M models, the number of parameters to be estimated and memory requirements are reduced by a factor of (M - 1).
In this paper, large vocabulary children's speech recognition is investigated by using the Deep Neural Network - Hidden Markov Model (DNN-HMM) hybrid and the Subspace Gaussian Mixture Model (SGMM) acoustic modeling approach. In the investigated scenario training data is limited to about 7 hours of speech from children in the age range 7-13 and testing data consists in read clean speech from children in the same age range. To tackle inter-speaker acoustic variability, speaker adaptive training, based on feature space maximum likelihood linear regression, as well as vocal tract length normalization are adopted. Experimental results show that with both DNN-HMM and SGMM systems very good recognition results can be achieved although best results are obtained with the DNN-HMM system.
Hidden Markov Model (HMM)-based synthesis in combination with speaker adaptation has proven to be an approach that is well-suited for child speech synthesis [1]. This paper describes the development and evaluation of different HMM-based child speech synthesis systems. The aim is to determine the most suitable combination of initial model and speaker adaptation techniques to synthesize child speech. The results of the study indicate that gender-independent initial models perform better than gender-dependent initial models and Constrained Structural Maximum a Posteriori Linear Regression (CSMAPLR) followed by maximum a posteriori (MAP) is the speaker adaptation technique combination that yields the most natural and intelligible synthesized child speech.
Since children (5-9 years old) are still developing their emotional and social skills, their social interactional behaviors in small groups might differ from adults' interactional behaviors. In order to develop a robot that is able to support children performing collaborative tasks in small groups, it is necessary to gain a better understanding of how children interact with each other. We were interested in investigating vocal turn-taking patterns as we expect these to reveal relations to collaborative and conflict behaviors, especially with children behaviors as previous literature suggests. To that end, we collected an audiovisual corpus of children performing collaborative tasks together in groups of three. Through automatic turn-taking analyses, our results showed that speaker changes with overlaps are more common than without overlaps and children seemed to show smoother turn-taking patterns, i.e., less frequent and longer lasting speaker changes, during collaborative than conflict behaviors.
Speech delay is a childhood language problem that sometimes is resolved on its own but sometimes may cause more serious language difficulties later. This leads therapists to screen children for detection at early ages in order to eliminate future problems. Using the Goldman-Fristoe Test of Articulation (GFTA) method, therapists listen to a child's pronunciation of certain phonemes and phoneme pairs in specified words and judge the child's stage of speech development. The goal of this paper is to develop an Automatic Speech Recognition (ASR) tool and related speech processing methods which emulate the knowledge of speech therapists. In this paper two methods of feature extraction (MFCC and DCTC) were used as the baseline for training an HMM-based utterance verification system which was later used for testing the utterances of 63 young children (ages 4-10), both typically developed and speech delayed. The ASR results show the value of augmenting static spectral information with spectral trajectory information for better prediction of therapist's judgments.
The automatic evaluation of children's reading performance by detecting and analyzing errors and disfluencies in speech is an important tool to build automatic reading tutors and to complement the current method of manual evaluations of overall reading ability in schools. A large amount of speech from children reading aloud plentiful in errors and disfluencies is needed to train acoustic, disfluency and pronunciation models for an automatic reading assessment system. This paper describes the acquisition and analysis of a read-aloud speech database of European Portuguese from children aged 6-10 from the first to fourth school grades. Towards the goal of detecting all reading errors and disfluencies, we apply a decoding process to the utterances using flexible word level lattices that allow syllable based false starts and repetitions of two or more word sequences. The proposed method proved promising in detecting corrections and repetitions in sentences, and provides an improved alignment of the data, helpful for future annotation tasks. The analysis of the database also shows agreement to government defined curricular goals for reading.
Word spotting, or keyword identification, is a highly challenging task when there are multiple speakers speaking simultaneously. In the case of a game being controlled by children solely through voice, the task becomes extremely difficult. Children, unlike adults, typically do not await their turn to speak in an orderly fashion. They interrupt and shout at arbitrary times, speak or say things that are not within the purview of the game vocabulary, arbitrarily stretch, contract, distort or rapid-repeat words, and do not stay in one location either horizontally or vertically. Consequently, standard state-of-art keyword spotting systems that work admirably for adults in multiple keyword settings, fail to perform well even in a basic two-word vocabulary keyword spotting task in the case of children. This paper highlights the issues with keyword spotting using a simple two-word game played by children of different age groups, and gives quantitative performance assessments using a novel keyword spotting technique that is especially suited to such scenarios.
This paper proposes an age-dependent scheme for automatic height estimation and speaker normalization of children's speech, using the first three subglottal resonances (SGRs). Similar to previous work, our analysis indicates that children above the age of 11 years show different acoustic properties from those under 11. Therefore, an age-dependent model is investigated. The estimation algorithms for the first three SGRs are motivated by our previous research for adults. The algorithms for the first two SGRs have been applied to children's speech before. This paper proposes a similar approach to estimate Sg3 for children. The algorithm is trained and evaluated on 46 children, aged between 6-17 years, using cross-validation. Average RMS errors in estimating Sg1, Sg2 and Sg3 using the age-dependent model are 51, 128 and 168 Hz, respectively. The height estimation algorithm employs a negative correlation between SGRs and height, and the mean absolute height estimation error was found to be less than 3.8cm for the younger children and 4.9cm for the older children. In addition, using TIDIGITS, a linear frequency warping scheme using age-dependent Sg3 gives statistically-significant word error rate reductions (up to 26%) relative to conventional VTLN.
This paper presents a novel approach to robust estimation of linear prediction (LP) model parameters in the application of speech enhancement. The robustness stems from the use of prior knowledge on the clean speech and the interfering noise, which are represented by two separate codebooks of LP model parameters. We propose to model the temporal dependency between short-time model parameters with a composite hidden Markov model (HMM) that is constructed by combining the speech and the noise codebooks. Optimal speech model parameters are estimated from the HMM state sequence that best matches the input observation. To further improve the estimation accuracy, we propose to perform interpolation of multiple HMM state sequences such that the estimated speech parameters would not be limited by the codebook coverage. Experimental results demonstrate the benefits and effectiveness of temporal dependency modeling and states interpolation in improving the segmental signal-to-noise ratio, PESQ and spectral distortion of enhanced speech.