Total: 89

Recently, we proposed and developed the context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) for large vocabulary speech recognition and achieved highly promising recognition results including over one third fewer word errors than the discriminatively trained, conventional HMM-based systems on the 300hr Switchboard benchmark task. In this paper, we extend DNNs to deep tensor neural networks (DTNNs) in which one or more layers are double-projection and tensor layers. The basic idea of the DTNN comes from our realization that many factors interact with each other to predict the output. To represent these interactions, we project the input to two nonlinear subspaces through the double-projection layer and model the interactions between these two subspaces and the output neurons through a tensor with three-way connections. Evaluation on 30hr Switchboard task indicates that DTNNs can outperform DNNs with similar number of parameters with 5% relative word error reduction.

Training neural network acoustic models with sequence-discriminative criteria, such as state-level minimum Bayes risk (sMBR), been shown to produce large improvements in performance over cross-entropy. However, because they entail the processing of lattices, sequence criteria are much more computationally intensive than cross-entropy. We describe a distributed neural network training algorithm, based on Hessian-free optimization, that scales to deep networks and large data sets. For the sMBR criterion, this training algorithm is faster than stochastic gradient descent by a factor of 5.5 and yields a 4.4% relative improvement in word error rate on a 50-hour broadcast news task. Distributed Hessianfree sMBR training yields relative reductions in word error rate of 7-13% over cross-entropy training with stochastic gradient descent on two larger tasks: Switchboard and DARPA RATS noisy Levantine Arabic. Our best Switchboard DBN achieves a word error rate of 16.4% on rt03-FSH.

We present a deep neural network (DNN) architecture which learns time-dependent offsets to acoustic feature vectors according to a discriminative objective function such as maximum mutual information (MMI) between the reference words and the transformed acoustic observation sequence. A key ingredient in this technique is a greedy layer-wise pretraining of the network based on minimum squared error between the DNN outputs and the offsets provided by a linear feature-space MMI (FMMI) transform. Next, the weights of the pretrained network are updated with stochastic gradient ascent by backpropagating the MMI gradient through the DNN layers. Experiments on a 50 hour English broadcast news transcription task show a 4% relative improvement using a 6-layer DNN transform over a state-of-the-art speakeradapted system with FMMI and model-space discriminative training.

Gaussian Mixture Model (GMM) and Multi Layer Perceptron (MLP) based acoustic models are compared on a French large vocabulary continuous speech recognition (LVCSR) task. In addition to optimizing the output layer size of the MLP, the ef- fect of the deep neural network structure is also investigated. Moreover, using different linear transformations (time deriva- tives, LDA, CMLLR) on conventional MFCC, the study is also extended to MLP based probabilistic and bottle-neck TANDEM features. Results show that using either the hybrid or bottle- neck TANDEM approach leads to similar recognition perfor- mance. However, the best performance is achieved when deep MLP acoustic models are trained on concatenated cepstral and context-dependent bottle-neck features. Further experiments re- veal the importance of the neighbouring frames in case of MLP based modeling, and that its gain over GMM acoustic models is strongly reduced by more complex features.

Recent work on deep neural networks as acoustic models for automatic speech recognition (ASR) have demonstrated substantial performance improvements. We introduce a model which uses a deep recurrent auto encoder neural network to denoise input features for robust ASR. The model is trained on stereo (noisy and clean) audio features to predict clean features given noisy input. The model makes no assumptions about how noise affects the signal, nor the existence of distinct noise environments. Instead, the model can learn to model any type of distortion or additive noise given sufficient training data. We demonstrate the model is competitive with existing feature denoising approaches on the Aurora2 task, and outperforms a tandem approach where deep networks are used to predict phoneme posteriors directly.

The Context-Dependent Deep-Neural-Network HMM, or CDDNN-HMM, is a recently proposed acoustic-modeling technique for HMM-based speech recognition that can greatly outperform conventional Gaussian-mixture based HMMs. For example, a CD-DNN-HMM trained on the 2000h Fisher corpus achieves 14.4% word error rate on the Hub5'00-FSH speakerindependent phone-call transcription task, compared to 19.6% obtained by a state-of-the-art, conventional discriminatively trained GMM-based HMM. That CD-DNN-HMM, however, took 59 days to train on a modern GPGPU — the immense computational cost of the minibatch based back-propagation (BP) training is a major roadblock. Unlike the familiar Baum-Welch training for conventional HMMs, BP cannot be efficiently parallelized across data. In this paper we show that the pipelined approximation to BP, which parallelizes computation with respect to layers, is an efficient way of utilizing multiple GPGPU cards in a single server. Using 2 and 4 GPGPUs, we achieve a 1.9 and 3.3 times end-to-end speed-up, at parallelization efficiency of 0.95 and 0.82, respectively, at no loss of recognition accuracy.

We propose a novel approach to acoustic modeling based on recent advances in sparse representations. The key idea in sparse coding is to compute a compressed local representation of a signal via an over-complete basis or dictionary that is learned in an unsupervised way. In this study, we compute the local representation on speech spectrogram as the raw “signal” and use it as the local sparse code to perform a standard phone classification task. A linear classifier is used that directly receives the coding space for making the classification decision. The simplicity of the linear classifier allows us to assess whether the sparse representations are sufficiently rich to serve as effective acoustic features for discriminating speech classes. Our experiments demonstrate competitive error rates when compared to other shallow approaches. An examination of the dictionary learned in sparse feature extraction demonstrates meaningful acoustic-phonetic properties that are captured by a collection of the dictionary entries.

In the state-of-the-art automatic speech recognition (ASR) systems, adaption techniques are used to the mitigate performance degradation caused by the mismatch in the training and testing procedure. Although there are bunch of adaption techniques for the hidden Markov models (HMM)-GMM-based system, there is rare work about the adaption in the hybrid artificial neural network~(ANN)/HMM-based system. Recently, there is a resurgence on ANN/HMM scheme for ASR with the success of context dependent deep neural network HMM~(CD-DNN/ HMM). Therefore in this paper, we present our initial efforts on the adaption techniques in the CD-DNN/HMM system. Specially, a linear input network(LIN)-based method and a neural network retraining(NNR)-based method is experimentally explored for the the task-adaptation purpose. Experiments on conversation telephone speech data set shows that these techniques can improve the system significantly and LINbased method seems to work better with medium mount of adaptation data.

The use of Deep Belief Networks (DBN) to pretrain Neural Networks has recently led to a resurgence in the use of Artificial Neural Network - Hidden Markov Model (ANN/HMM) hybrid systems for Automatic Speech Recognition (ASR). In this paper we report results of a DBN-pretrained contextdependent ANN/HMM system trained on two datasets that are much larger than any reported previously with DBN-pretrained ANN/HMM systems - 5870 hours of Voice Search and 1400 hours of YouTube data. On the first dataset, the pretrained ANN/HMM system outperforms the best Gaussian Mixture Model - Hidden Markov Model (GMM/HMM) baseline, built with a much larger dataset by 3.7% absolute WER, while on the second dataset, it outperforms the GMM/HMM baseline by 2.9% absolute. Maximum Mutual Information (MMI) fine tuning and model combination using Segmental Conditional Random Fields (SCARF) give additional gains of 0.1% and 0.4% on the first dataset and 0.6% and 1.1% absolute on the second dataset.

Recently there has been some interest in the question of how to build LVCSR systems for the low-resource languages. The scenario we focus on here is having only one hour of acoustic training data in the "target" language, but more plentiful data in other languages. This paper presents approaches using MLP based features: we construct a low-resource system with additional sources of information from the non-target languages to train the cross-lingual MLPs. A hierarchical architecture and multi-stream strategy are applied on the cross-lingual phone level, to improve the neural network more discriminatively. Additionally, an elaborate ensemble system with various acoustic feature streams and context expansion lengths is proposed. After system combination with these two strategies we get significant improvements of more than 8% absolute versus a conventional baseline in this low-resource scenario with only one hour of target training data.

In this paper we present our latest investigation on initialization schemes for Multilayer Perceptron (MLP) training using multilingual data. We show that the overall performance of an MLP network improves significantly by initializing it with a multilingual MLP. We propose a new strategy called "open target language" MLP to train more flexible models for language adaptation, which is particularly suited for small amounts of training data. Furthermore, by applying Bottle-Neck feature (BN) initialized with multilingual MLP the ASR performance increases on both, on those languages which were used for multilingual MLP training, and on a new language. Our experiments show word error rate improvements of up to 16.9% relative on a range of tasks for different target languages (Creole and Vietnamese) with manually and automatically transcribed training data.

This work is concerned with speaker adaptation techniques for artificial neural network (ANN) implemented as feed forward multi-layer perceptrons (MLPs) in the context of large vocabulary continuous speech recognition (LVCSR). Most successful speaker adaptation techniques for MLPs consist of augmenting the neural architecture with a linear transformation network connected to either the input or the output layer. The weights of this additional linear layer are learned during the adaptation phase while all of the other weights are kept frozen in order to avoid over-fitting. In doing so, the structure of the speaker-dependent (SD) and speaker-independent (SI) architecture differs and the number of adaptation parameters depends upon the dimension of either the input or output layers. We propose a more flexible neural architecture for speaker-adaptation to overcome the limits of current approaches. This flexibility is achieved by adopting hidden activation functions that can be learned directly from the adaptation data. This adaptive capability of the hidden activation function is achieved through the use of orthonormal Hermite polynomials. Experimental evidence gathered on the Nov92 task demonstrates the viability of the proposed technique.

Recently, deep neural networks have been collecting attention of speech researchers due to its capability of handling nonlinearity in speech feature vectors. On the other hand, speech recognition based on structured classification is also considered important since it successfully exploits interdependency of several information sources. In this paper, we focus on the structured classification method based on weighted finite-state transducers (WFSTs) that introduces linear classification term for each arc transition cost in decoding network to capture contextural information of labels. Since these two approaches attempt to improve representation of features and labels, respectively, the combination of these models would be efficient because of complementarity. Thus, this paper proposes a method that combines deep neural network techniques with WFST-based structured classification approaches. In the proposed method, DNNs are used to extract classification friendly features; and then, the features are classified by using WFST-based structured classifiers. The proposed method is evaluated by using TIMIT continuous phoneme recognition tasks. We confirmed that combining structured classification leads to stable performance improvements even from the well-optimized deep neural network acoustic models.

The Deep stacking network (DSN) is a special type of deep architecture developed to enable parallel learning of its weight parameters distributed over large CPU clusters. This capability of DSN in learning parallelism is unique among all deep models explored so far. As a prospective key component of next-generation speech recognizers, the architectural design of the DSN and its parallel learning enable DSNfs scalability over a potentially unlimited amount of training data and over CPU clusters. In this paper, we present our first parallel implementation of the DSN learning algorithm. Particularly, we show the tradeoff between the time/memory saving via a high degree of parallelism and the associated cost arising from inter-CPU communication. In addition, in phone classification experiments, we demonstrate a significantly lowered error rate achieved by DSN with full-batch training, which is enabled by parallel implementation in a CPU cluster, than with the corresponding mini-batch training exploited prior to the work reported in this paper.

Large vocabulary continuous speech recognition is particularly difficult for low-resource languages. In the scenario we focus on here is that there is a very limited amount of acoustic training data in the target language, but more plentiful data in other languages. In our approach, we investigate approaches based on Automatic Speech Attribute Transcription (ASAT) framework, and train universal classifiers using multilanguages to learn articulatory features. A hierarchical architecture is applied on both the articulatory feature and phone level, to make the neural network more discriminative. Finally we train the multilayer perceptrons using multi-streams from different languages and obtain MLPs for this low-resource application. In our experiments, we get significant improvements of about 12% relative versus a conventional baseline in this low-resource scenario.

In this paper we show how the robustness of multi-stream multi-layer perceptron (MLP) acoustic models can be increased through uncertainty propagation and decoding. We demonstrate that MLP uncertainty decoding yields consistent improvements over using minimum mean square error (MMSE) feature enhancement in MFCC and RASTA-LPCC domains. We introduce as well formulas for the computation of the uncertainty associated to the acoustic likelihood computation and explore different stream integration schemes using this uncertainty on the AURORA4 corpus.

This paper presents a new microphone-array post-filtering algorithm for distant speech recognition (DSR). Conventionally, post-filtering methods assume static noise field models, and using this assumption, employ a Wiener filter mechanism for estimating the noise parameters. In contrast to this, we show how we can build the Wiener post-filter based on actual noise observations without any noise-field assumption. The algorithm is framed within a state-of-the-art beamforming technique, namely maximum negentropy (MN) beamforming with super directivity. We investigate the effectiveness of the proposed post-filter on DSR through experiments on noisy data collected in a car under different acoustic conditions. Experiments show that the new post-filtering mechanism is able to achieve up to 20% relative reduction of word error rates (WER) under the represented noise conditions, as compared to a single distant microphone. In contrast, super-directive (SD) beamforming followed by Zelinski post-filtering achieves a relative WER reduction of only up to 11%. Other post-filters evaluated perform similarly in comparison to the proposed post-filter.

We address the speaker independent automatic recognition of spontaneous speech in highly instationary noise by applying semi-supervised sparse non-negative matrix factorization (NMF) for speech enhancement coupled with our recently proposed front end utilizing bottleneck (BN) features generated by a bidirectional Long Short-Term Memory (BLSTM) recurrent neural network. In our evaluation, we unite the noise corpus and evaluation protocol of the 2011 PASCAL CHiME challenge with the Buckeye database, and we demonstrate that the combination of NMF enhancement and BNBLSTM front end introduces significant and consistent gains in word accuracy in this highly challenging task at signal-to-noise ratios from -6 to 9 dB.

Joint uncertainty decoding (JUD) is an effective model-based noise compensation technique for conventional Gaussian mixture model (GMM) based speech recognition systems. In this paper, we apply JUD to subspace Gaussian mixture model (SGMM) based acoustic models. The total number of Gaussians in the SGMM acoustic model is usually much larger than for conventional GMMs, which limits the application of approaches which explicitly compensate each Gaussian, such as vector Taylor series (VTS). However, by clustering the Gaussian components into a number of regression classes, JUD-based noise compensation can be successfully applied to SGMM systems. We evaluate the JUD/SGMM technique using the Aurora 4 corpus, and the experimental results indicated that it is more accurate than conventional GMM-based systems using either VTS or JUD noise compensation.

On the AURORA-2 task good results at low SNR levels have been obtained with a system that uses state posterior estimates provided by an exemplar-based sparse classification (SC) system. At the same time, posterior estimates obtained with multilayer perceptron (MLP) yield good results at high SNRs. In this paper, we investigate the effect of combining the estimates from the SC and MLP systems at the probability level. More precisely, the probabilities are combined by a sum rule or a product rule using static and inverse-entropy based dynamic weights. In addition, we investigate a modified dynamic weighting approach which enhances the contribution of SC stream based on the information about static weights and average dynamic weights obtained on cross validation data. Our studies on AURORA-2 task shows that in all conditions the modified dynamic weighting approach yields a dual-input system that performs better than or equal to the best stand-alone system.

Log energy and its delta parameters, typically derived from full-band spectrum, are commonly used in automatic speech recognition (ASR) systems. In this paper, we address the problem of estimating log energy in the presence of background noise (usually resulting in a reduction in dynamic ranges of spectral energies). We theoretically show that the background noise affects the trajectories of the "conventional" log energy and its delta parameters, resulting in very poor estimation of the actual log energy and its delta parameters, which no longer describe the speech signal. We thus propose to estimate log energy from the sub-band spectrum, followed by a dynamic range stretching. Based on speech recognition experiments conducted on CENSREC-2 in-car database, the proposed log energy (and its corresponding delta parameters) is shown to perform very well, resulting in an average relative improvement of 27.2% compared with the baseline front-ends. Moreover, it is also shown that further improvement can be achieved by incorporating those new MFCCs obtained through non-linear spectral contrast stretching.

In this paper, we adress the problem of additive noise which degrades substantially the performances of speech recognition system. We propose a cepstral denoising based on the Subspace Gaussian Mixture Models paradigm (SGMM). The acoustic space is modeled by using a UBM-GMM. Each phoneme is modeled by a GMM derived from the UBM. The concatenation of the means of a given GMM leads to a very high dimention space, called the supervector space. The SGMM paradigm allows to model the additive noise as an additive component located in a subspace of low dimension (with respect to the supervector space). For each speech segment, this additive noise component is estimated in a model space. From this estimation, a specific frame transformation is obtained and applied to such a data frame. In this work, training data are assumed to be clean, so the cleaning process is applied only on test data. The proposed approach is tested on data recorded in a noisy environment and also on artificially noised data. With this approach we obtain, on data recorded in a noisy environment, a relative WER reduction of 15%.

In this paper, we propose a novel representation of the FMLLR transform. This is different from the standard FMLLR in that the linear transform (LT) is expressed in a factorized form such that each of the factors involves only one parameter. The representation is mainly motivated by QR-decomposition of a square matrix and hence is referred to as QR-FMLLR. The mathematical expressions and steps for maximum likelihood (ML) estimation of the parameters are presented. The ML estimation of QR-FMLLR does not require the use of numerical technique, such as gradient ascent, and it does not involve matrix inversion and computation of matrix determinant. On an LVCSR task, we show the performance of QR-FMLLR to be comparable to the standard FMLLR. We conjecture that QR-FMLLR is amenable to speaker adaptation using data that varies from very short to large and present a brief discussion on how this can be achieved.

A non-linear discriminant analysis based approach to feature space dimensionality reduction in noise robust automatic speech recognition (ASR) is proposed. It utilizes a correlation based distance measure instead of the conventional Euclidean distance. The use of this "correlation preserving discriminant analysis" (CPDA) procedure is motivated by evidence suggesting that correlation based cepstrum distance measures can be more robust than Euclidean based distances when speech is corrupted by noise. The performance of CPDA is evaluated in terms of the word error rate obtained by using CPDA derived features on the Aurora 2 speech in noise corpus, and is compared to the commonly used linear discriminant analysis (LDA) approach to feature space transformations.

In this work, we investigate the feasibility of applying our prior works on discriminative training (DT) using non-uniform criteria to a keyword spotting task on spontaneous conversational speech. One of DT methods, minimum classification error (MCE), is recast and efficiently implemented in the weighted finite state transducer (WFST) framework to fit a keyword spotting task. To validate our approach, we evaluate it on a conversational speech task, the credit card use subset of Switchboard, in both kinds of keyword spotting scenarios: one is when a large vocabulary continuous speech recognition (LVCSR) decoder is available, the other is when a simple word-loop grammar of limited vocabulary is used. The results show our approach performs well in both cases, achieving 2.77% and 3.15% figure of merits (FOMs) absolute improvements