| Total: 59
In this paper, we explore multilingual feature-level data sharing via Deep Neural Network (DNN) stacked bottleneck features. Given a set of available source languages, we apply language identification to pick the language most similar to the target language, for more efficient use of multilingual resources. Our experiments with IARPA-Babel languages show that bottleneck features trained on the most similar source language perform better than those trained on all available source languages. Further analysis suggests that only data similar to the target language is useful for multilingual training.
Conventional acoustic models, such as Gaussian mixture models (GMM) or deep neural networks (DNN), cannot be reliably estimated when there are very little speech training data, e.g. less than 1 hour. In this paper, we investigate the use of a non-parametric kernel density estimation method to predict the emission probability of HMM states. In addition, we introduce a discriminative score calibrator to improve the speech class posteriors generated by the kernel density for speech recognition task. Experimental results on the Wall Street Journal task show that the proposed acoustic model using cross-lingual bottleneck features significantly outperforms GMM and DNN models for limited training data case.
This paper presents our latest investigation of automatic speech recognition (ASR) on non-native speech. We first report on a non-native speech corpus — an extension of the GlobalPhone database — which contains English with Bulgarian, Chinese, German and Indian accent and German with Chinese accent. In this case, English is the spoken language ( L2) and Bulgarian, Chinese, German and Indian are the mother tongues ( L1) of the speakers. Afterwards, we investigate the effect of multilingual acoustic modeling on non-native speech. Our results reveal that a bilingual L1-L2 acoustic model significantly improves the ASR performance on non-native speech. For the case that L1 is unknown or L1 data is not available, a multilingual ASR system trained without L1 speech data consistently outperforms the monolingual L2 ASR system. Finally, we propose a method called crosslingual accent adaptation, which allows using English with Chinese accent to improve the German ASR on German with Chinese accent and vice versa. Without using any intra lingual adaptation data, we achieve 15.8% relative improvement in average over the baseline system.
Developing high-performance speech processing systems for low-resource languages is very challenging. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to train a multi-language bottleneck DNN. Language dependent and/or multi-language (all training languages) Tandem acoustic models (AM) are then trained. This work considers a particular scenario where the target language is unseen in multi-language training and has limited language model training data, a limited lexicon, and acoustic training data without transcriptions. A zero acoustic resources case is first described where a multi-language AM is directly applied, as a language independent AM (LIAM), to an unseen language. Secondly, in an unsupervised approach a LIAM is used to obtain hypotheses for the target language acoustic data transcriptions which are then used in training a language dependent AM. 3 languages from the IARPA Babel project are used for assessment: Vietnamese, Haitian Creole and Bengali. Performance of the zero acoustic resources system is found to be poor, with keyword spotting at best 60% of language dependent performance. Unsupervised language dependent training yields performance gains. For one language (Haitian Creole) the Babel target is achieved on the in-vocabulary data.
Posterior-based or bottleneck features derived from neural networks trained on out-of-domain data may be successfully applied to improve speech recognition performance when data is scarce for the target domain or language. In this paper we combine this approach with the use of a hierarchical deep neural network (DNN) network structure — which we term a multi-level adaptive network (MLAN) — and the use of multitask learning. We have applied the technique to cross-lingual speech recognition experiments on recordings of TED talks and European Parliament sessions in English (source language) and German (target language). We demonstrate that the proposed method can lead to improvements over standard methods, even when the quantity of training data for the target language is relatively high. When the complete method is applied, we achieve relative WER reductions of around 13% compared to a monolingual hybrid DNN baseline.
Despite various advances in automatic speech recognition (ASR) technology, recognition of speech uttered by non-native speakers is still a challenging problem. In this paper, we investigate the role of different factors such as type of lexical model and choice of acoustic units in recognition of speech uttered by non-native speakers. More precisely, we investigate the influence of the probabilistic lexical model in the framework of Kullback-Leibler divergence based hidden Markov model (KL-HMM) approach in handling pronunciation variabilities by comparing it against hybrid HMM/artificial neural network (ANN) approach where the lexical model is deterministic. Moreover, we study the effect of acoustic units (being context-independent or clustered context-dependent phones) on ASR performance in both KL-HMM and hybrid HMM/ANN frameworks. Our experimental studies on French part of MediaParl as a bilingual corpus indicate that the probabilistic lexical modeling approach in the KL-HMM framework can capture the pronunciation variations present in non-native speech effectively. More precisely, the experimental results show that the KL-HMM system using context-dependent acoustic units and trained solely on native speech data can lead to better ASR performance than adaptation techniques such as maximum likelihood linear regression.
We compare the performance of Directional Derivatives features for automatic speech recognition when extracted from different time-frequency representations. Specifically, we use the short-time Fourier transform, Mel-frequency, and Gammatone spectrograms as a base from which we extract spectro-temporal modulations. We then assess the noise robustness of each representation with varied number of frequency bins and dynamic range compression schemes for both word and phone recognition. We find that the choice of dynamic range compression approach has the most significant impact on recognition performance. Whereas, the performance differences between perceptually motivated filter-banks are minimal in the proposed framework. Furthermore, this work presents significant gains in speech recognition accuracy for low SNRs over MFCCs, GFCCs, and Directional Derivatives extracted from the log-Mel spectrogram.
We propose a signal pre-processing front-end to enhance speech based on deep neural networks (DNNs) and use the enhanced speech features directly to train hidden Markov models (HMMs) for robust speech recognition. As a comprehensive study, we examine its effectiveness for different acoustic features, acoustic models, and training-testing combinations. Tested on the Aurora4 task the experimental results indicate that our proposed framework consistently outperform the state-of-the-art speech recognition systems in all evaluation conditions. To our best knowledge, this is the first showcase on the Aurora4 task yielding performance gains by using only an enhancement pre-processor without any adaptation or compensation post-processing on top of the best DNN-HMM system. The word error rate reduction from the baseline system is up to 50% for clean-condition training and 15% for multi-condition training. We believe the system performance could be improved further by incorporating post-processing techniques to work coherently with the proposed enhancement pre-processing scheme.
Noise-robust automatic speech recognition (ASR) systems rely on feature and/or model compensation. Existing compensation techniques typically operate on the features or on the parameters of the acoustic models themselves. By contrast, a number of normalization techniques have been defined in the field of speaker verification that operate on the resulting log-likelihood scores. In this paper, we provide a theoretical motivation for likelihood normalization due to the so-called “hubness” phenomenon and we evaluate the benefit of several normalization techniques on ASR accuracy for the 2nd CHiME Challenge task. We show that symmetric normalization (S-norm) reduces the relative error rate by 43% alone and by 10% after feature and model compensation.
Previous comparisons of human speech recognition (HSR) and automatic speech recognition (ASR) focused on monaural signals in additive noise, and showed that HSR is far more robust against intrinsic and extrinsic sources of variation than conventional ASR. The aim of this study is to analyze the man-machine gap (and its causes) in more complex acoustic scenarios, particularly in scenes with two moving speakers, reverberation and diffuse noise. Responses of nine normal-hearing listeners are compared to errors of an ASR system that employs a binaural model for direction-of-arrival estimation and beamforming for signal enhancement. The overall man-machine gap is measured in terms for the speech recognition threshold (SRT), i.e., the signal-to-noise ratio at which a 50% recognition rate is obtained. The comparison shows that the gap amounts to 16.7 dB SRT difference which exceeds the difference of 10 dB found in monaural situations. Based on cross comparisons that use oracle knowledge (e.g., the speakers' true position), incorrect responses are attributed to localization errors (7 dB) or missing spectral information to distinguish between speakers with different gender (3 dB). The comparison hence identifies specific ASR components that can profit from learning from binaural auditory signal processing.
One method to achieve robust speech recognition in adverse conditions including noise and reverberation is to employ acoustic modelling techniques involving neural networks. Using long short-term memory (LSTM) recurrent neural networks proved to be efficient for this task in a setup for phoneme prediction in a multi-stream GMM-HMM framework. These networks exploit a self-learnt amount of temporal context, which makes them especially suited for a noisy speech recognition task. One shortcoming of this approach is the necessity of a GMM acoustic model in the multi-stream framework. Furthermore, potential modelling power of the network is lost when predicting phonemes, compared to the classical hybrid setup where the network predicts HMM states. In this work, we propose to use LSTM networks in a hybrid HMM setup, in order to overcome these drawbacks. Experiments are performed using the medium-vocabulary recognition track of the 2nd CHiME challenge, containing speech utterances in a reverberant and noisy environment. A comparison of different network topologies for phoneme or state prediction used either in the hybrid or double-stream setup shows that state prediction networks perform better than networks predicting phonemes, leading to state-of-the-art results for this database.
Context-dependent Deep Neural Network has obtained consistent and significant improvements over the Gaussian Mixture Model (GMM) based systems for various speech recognition tasks. However, since DNN is discriminatively trained, it is more sensitive to label errors and is not reliable for unsupervised adaptation. Moreover, DNN parameters do not have a clear and meaningful interpretation, therefore, it has been difficult to develop effective adaptation techniques for the DNNs. Nevertheless, unadapted multi-style trained DNNs have already shown superior performance to the GMM system with joint noise/speaker adaptation and adaptive training. Recently, Temporally Varying Weight Regression (TVWR) has been successfully applied to combine DNN and GMM for robust unsupervised speaker adaptation. In this paper, joint speaker/noise adaptation and adaptive training of TVWR using DNN posteriors are investigated for robust speech recognition. Experimental results on the Aurora 4 corpus showed that after joint adaptation and adaptive training, TVWR achieved 21.3% and 11.6% relative improvements over the DNN baseline system and the best system in currently reported literatures, respectively.
The precedence effect describes the ability of the auditory system to suppress the later-arriving components of sound in a reverberant environment, maintaining the perceived arrival azimuth of a sound in the direction of the actual source, even though later reverberant components may arrive from other directions. It is also widely believed that precedence-like processing can also improve speech intelligibility, as well as the accuracy of speech recognition systems, in reverberant environments. While the mechanisms underlying the precedence effect have traditionally been assumed to be binaural in nature, it is also possible that the suppression of later components may take place monaurally, and that the suppression of the later-arriving components of the spatial image may be a consequence of this more peripheral processing. This paper compares the potential contributions of onset enhancement (and consequent steady-state suppression) of the envelopes of subband components of speech at both the monaural and binaural levels. Experimental results indicate that substantial improvement in recognition accuracy can be obtained in reverberant environments if the feature extraction includes both onset enhancement and binaural interaction. Recognition accuracy appears to be relatively unaffected by the stage in the suppression processing at which the binaural interaction takes place.
In this paper, we propose variable-component DNN (VCDNN) to improve the robustness of context-dependent deep neural network hidden Markov model (CD-DNN-HMM). This method is inspired by the idea from variable-parameter HMM (VPHMM) in which the variation of model parameters are modeled as a set of polynomial functions of environmental signal-to-noise ratio (SNR), and during the testing, the model parameters are recomputed according to the estimated testing SNR. In VCDNN, we refine two types of DNN components: (1) weighting matrix and bias (2) the output of each layer. Experimental results on Aurora4 task show VCDNN achieved 6.53% and 5.92% relative word error rate reduction (WERR) over the standard DNN for the two methods, respectively. Under unseen SNR conditions, VCDNN gave even better result (8.46% relative WERR for the DNN varying matrix and bias, 7.08% relative WERR for the DNN varying layer output). Moreover, VCDNN with 1024 units per hidden layer beats the standard DNN with 2048 units per hidden layer with 3.22% WERR and a half computational/memory cost reduction, showing superior ability to produce sharper and more compact models.
Modulation spectrum processing of acoustic features has received considerable attention in the area of robust speech recognition because of its relative simplicity and good empirical performance. An emerging school of thought is to conduct nonnegative matrix factorization (NMF) on the modulation spectrum domain so as to distill intrinsic and noise-invariant temporal structure characteristics of acoustic features for better robustness. This paper presents a continuation of this general line of research and its main contribution is two-fold. One is to explore the notion of sparsity for NMF so as to ensure the derived basis vectors have sparser and more localized representations of the modulation spectra. The other is to investigate a novel cluster-based NMF processing, in which speech utterances belonging to different clusters will have their own set of cluster-specific basis vectors. As such, the speech utterances can retain more discriminative information in the NMF processed modulation spectra. All experiments were conducted on the Aurora-2 corpus and task. Empirical evidence reveals that our methods can offer substantial improvements over the baseline NMF method and achieve performance competitive to or better than several widely-used robustness methods.
Tandem systems based on multi-layer perceptrons (MLPs) have improved the performance of automatic speech recognition systems on both large vocabulary and noisy tasks. One potential problem of the standard Tandem approach, however, is that the MLPs generally used do not model temporal dynamics inherent in speech. In this work, we propose a hybrid MLP/Structured-SVM model, in which the parameters between the hidden layer and output layer and temporal transitions between output layers are modeled by a Structured-SVM. A Structured-SVM can be thought of as an extension to the classical binary support vector machine which can naturally classify “structures” such as sequences. Using this approach, we can identify sequences of phones in an utterance. We try this model on two different corpora — Aurora2 and the large-vocabulary section of the ICSI meeting corpus — to investigate the model's performance in noisy conditions and on a large-vocabulary task. Compared to a difficult Tandem baseline in which the MLP is trained using 2nd-order optimization methods, the MLP/Structured-SVM system decreases WER in noisy conditions by 7.9% relative. On the large vocabulary corpus, the proposed system decreasesWER by 1.1% absolute compared to the 2nd-order Tandem system.
In this paper, we present a new dereverberation algorithm called Temporal Masking and Thresholding (TMT) to enhance the temporal spectra of spectral features for robust speech recognition in reverberant environments. This algorithm is motivated by the precedence effect and temporal masking of human auditory perception. This work is an improvement of our previous dereverberation work called Suppression of Slowly-varying components and the falling edge of the power envelope (SSF). The TMT algorithm uses a different mathematical model to characterize temporal masking and thresholding compared to the model that had been used to characterize the SSF algorithm. Specifically, the nonlinear highpass filtering used in the SSF algorithm has been replaced by a masking mechanism based on a combination of peak detection and dynamic thresholding. Speech recognition results show that the TMT algorithm provides superior recognition accuracy compared to other algorithms such as LTLSS, VTS, or SSF in reverberant environments.
Recently deep neural networks (DNNs) have become increasingly popular for acoustic modelling in automatic speech recognition (ASR) systems. As the bottleneck features they produce are inherently discriminative and contain rich hidden factors that influence the surface acoustic realization, the standard approach is to augment the conventional acoustic features with the bottleneck features in a tandem framework. In this paper, an alternative approach to incorporate bottleneck features is investigated. The complex relationship between acoustic features and DNN bottleneck features is modelled using generalized variable parameter HMMs (GVP-HMMs). The optimal GVP-HMM structural configuration and model parameters are automatically learnt. Significant error rate reductions of 48% and 8% relative were obtained over the baseline multi-style HMM and tandem HMM systems respectively on Aurora 2.
Model compensation approach has been successfully applied to various noise robust speech recognition tasks. In this paper, based on Continuous Time (CT) approximation, the dynamic mismatch function is derived without further approximation. With such mismatch function, a novel approach to deriving the formula for calculating the dynamic statistics is presented. Besides, we also provide an insight on the processing of the pseudo inverse of non-square discrete cosine transform (DCT) matrix during model compensation. Experiments on Aurora 4 showed that the proposed approach obtained 23.2% relative WER reduction over traditional first-order Vector Taylor Series (VTS) approach.
We considered a speech recognition method for mixed sound, which is composed of both speech and music, that only removes music based on non-negative matrix factorization (NMF). We used Itakura-Saito divergence instead of Kullback-Leibler divergence to compare the cost function, and the dynamics and sparseness constraints of a weight matrix to improve speech recognition. For isolated word recognition using the matched condition model, we reduced the word error rate of 52.1% relative from the case that didn't remove music (on average, from 69.3% to 85.3%).
In conventional VTS-based noisy speech recognition methods, the parameters of the clean speech HMM are adapted to test noisy speech, or the original clean speech is estimated from the test noisy speech. However, in noisy speech recognition, improved performance is generally expected by employing noisy acoustic models produced by methods such as Multi-condition TRaining (MTR) and Multi-Model based Speech Recognition (MMSR) framework compared with using clean HMMs. Motivated by this idea, a method has been developed that can make use of the noisy acoustic models in the VTS algorithm where additive noise was adapted for the speech feature compensation. In this paper, we modified the previous method to adapt channel noise as well as additive noise. The proposed method was applied to noise-adapted HMMs trained by the MTR and MMSR and could reduce the relative word error rate by 6.5% and 7.2%, respectively, in the noisy speech recognition experiments on the Aurora 2 database.
This work presents a noise spectrum estimator based on the Gaussian mixture model (GMM)-based speech presence probability (SPP) for robust speech recognition. Estimated noise spectrum is then used to compute a subband a posteriori signal-to-noise ratio (SNR). A sigmoid shape weighting rule is formed based on this subband a posteriori SNR to enhance the speech spectrum in the auditory domain, which is used in the Mel-frequency cepstral coefficient (MFCC) framework for robust feature, denoted here as Robust MFCC (RMFCC) extraction. The performance of the GMM-SPP noise spectrum estimator-based RMFCC feature extractor is evaluated in the context of speech recognition on the AURORA-4 continuous speech recognition task. For comparison we incorporate six existing noise estimation methods into this auditory domain spectrum enhancement framework. The ETSI advanced front-end (ETSI-AFE), power normalized cepstral coefficients (PNCC), and robust compressive gammachirp cepstral coefficients (RCGCC) are also considered for comparison purposes. Experimental speech recognition results show that, in terms of word accuracy, RMFCC provides an average relative improvements of 8.1%, 6.9% and 6.6% over RCGCC, ETSI-AFE, and PNCC, respectively. With GMM-SPP-based noise estimation method an average relative improvement of 3.6% is obtained over other six noise estimation methods in terms of word recognition accuracy.
When deployed in automated speech recognition (ASR), deep neural networks (DNNs) can be treated as a complex feature extractor plus a simple linear classifier. Previous work has investigated the utility of multilingual DNNs acting as language-universal feature extractors (LUFEs). In this paper, we explore different strategies to further improve LUFEs. First, we replace the standard sigmoid nonlinearity with the recently proposed maxout units. The resulting maxout LUFEs have the nice property of generating sparse feature representations. Second, the convolutional neural network (CNN) architecture is applied to obtain more invariant feature space. We evaluate the performance of LUFEs on a cross-language ASR task. Each of the proposed techniques results in word error rate reduction compared with the existing DNN-based LUFEs. Combining the two methods together brings additional improvement on the target language.
A traditional framework in speech production describes the output speech as an interaction between a source excitation and a vocal-tract configured by the speaker to impart segmental characteristics. In general, this simplification has led to approaches where systems that focus on phonetic segment tasks (e.g. speech recognition) make use of a front-end that extracts features that aim to distinguish between different vocal-tract configurations. The excitation signal, on the other hand, has received more attention for speaker-characterization tasks. In this work we look at augmenting the front-end in a recognition system with vocal-source features, motivated by our work with languages that are low in resources and whose phonology and phonetics suggest the need for complementary approaches to classical ASR features. We demonstrate that the additional use of such features provides improvements over a state-of-the-art system for low-resource languages from the BABEL Program.
Recently there has been interest in the approaches for training speech recognition systems for languages with limited resources. Under the IARPA Babel program such resources have been provided for a range of languages to support this research area. This paper examines a particular form of approach, data augmentation, that can be applied to these situations. Data augmentation schemes aim to increase the quantity of data available to train the system, for example semi-supervised training, multilingual processing, acoustic data perturbation and speech synthesis. To date the majority of work has considered individual data augmentation schemes, with few consistent performance contrasts or examination of whether the schemes are complementary. In this work two data augmentation schemes, semi-supervised training and vocal tract length perturbation, are examined and combined on the Babel limited language pack configuration. Here only about 10 hours of transcribed acoustic data are available. Two languages are examined, Assamese and Zulu, which were found to be the most challenging of the Babel languages released for the 2014 Evaluation. For both languages consistent speech recognition performance gains can be obtained using these augmentation schemes. Furthermore the impact of these performance gains on a down-stream keyword spotting task are also described.