INTERSPEECH.2008 - Speech Recognition

| Total: 188

#1 Soft margin estimation with various separation levels for LVCSR [PDF] [Copy] [Kimi1] [REL]

Authors: Jinyu Li, Zhi-Jie Yan, Chin-Hui Lee, Ren-Hua Wang

We continue our previous work on soft margin estimation (SME) to large vocabulary continuous speech recognition (LVCSR) in two new aspects. The first is to formulate SME with different unit separation. SME methods focusing on string-, word-, and phone-level separation are defined. The second is to compare SME with all the popular conventional discriminative training (DT) methods, including maximum mutual information estimation (MMIE), minimum classification error (MCE), and minimum word/phone error (MWE/MPE). Tested on the 5k-word Wall Street Journal task, all the SME methods achieves a relative word error rate (WER) reduction from 17% to 25% over our baseline. Among them, phone-level SME obtains the best performance. Its performance is slightly better than that of MPE, and much better than those of other conventional DT methods. With the comprehensive comparison with conventional DT methods, SME demonstrates its success on LVCSR tasks.


#2 On the equivalence of Gaussian and log-linear HMMs [PDF] [Copy] [Kimi1] [REL]

Authors: Georg Heigold, Patrick Lehnen, Ralf Schlüter, Hermann Ney

The acoustic models of conventional state-of-the-art speech recognition systems use generative Gaussian HMMs. In the past few years, discriminative models like for example Conditional Random Fields (CRFs) have been proposed to refine the acoustic models. CRFs directly model the class posteriors, the quantities of interest in recognition. CRFs are undirected models, and CRFs do not assume local normalization constraints as HMMs do. This paper addresses the issue to what extent such less restricted models add flexiblity to the model compared with the generative counterpart. This work extends our previous work in that it provides the technical details used for showing the equivalence of Gaussian and log-linear HMMs. The correctness of the proposed equivalence transformation for conditional probabilities is demonstrated on a simple concept tagging task.


#3 Generalization of extended baum-welch parameter estimation for discriminative training and decoding [PDF] [Copy] [Kimi1] [REL]

Authors: Dimitri Kanevsky, Tara N. Sainath, Bhuvana Ramabhadran, David Nahamoo

We demonstrate the generalizability of the Extended Baum-Welch (EBW) algorithm not only for HMM parameter estimation but for decoding as well. We show that there can exist a general function associated with the objective function under EBW that reduces to the well-known auxiliary function used in the Baum-Welch algorithm for maximum likelihood estimates. We generalize representation for the updates of model parameters by making use of a differentiable function (such as arithmetic or geometric mean) on the updated and current model parameters and describe their effect on the learning rate during HMM parameter estimation. Improvements on speech recognition tasks are also presented here.


#4 An ellipsoid constrained quadratic programming perspective to discriminative training of HMMs [PDF] [Copy] [Kimi1] [REL]

Authors: Peng Liu, Frank K. Soong

In this paper, we reformulate the optimization in discriminative training (DT) of HMMs as an ellipsoid constrained quadratic programming (ECQP) problem, where a second order of the non-linear space is adopted. We show that the unique optimal solution of ECQP can be obtained by an efficient line search and no relaxation is needed as in a general quadratically constrained quadratic programming (QCQP). Moreover, a subspace combination condition is introduced to further simplify it under certain cases. The concrete ECQP form of DT of HMMs is given based on a locality constraint and reasonable assumptions, and the algorithm can be conducted to update Gaussians jointly or separately in either sequential or batch mode. Under the perspective of ECQP, relationships between various popular DT optimization algorithms are discussed. Experimental results on two recognition tasks show that ECQP considerably outperforms other popular algorithms in terms of final recognition accuracy and convergence speed in iterations.


#5 Discriminative training of variable-parameter HMMs for noise robust speech recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Dong Yu, Li Deng, Yifan Gong, Alex Acero

We propose a new type of variable-parameter hidden Markov model (VPHMM) whose mean and variance parameters vary each as a continuous function of additional environment-dependent parameters. Different from the polynomial-function-based VPHMM proposed by Cui and Gong (2007), the new VPHMM uses cubic splines to represent the dependency of the means and variances of Gaussian mixtures on the environment parameters. Importantly, the new model no longer requires quantization in estimating the model parameters and it supports parameter sharing and instantaneous conditioning parameters directly. We develop and describe a growth-transformation algorithm that discriminatively learns the parameters in our cubic-spline-based VPHMM (CS-VPHMM), and evaluate the model on the Aurora-3 corpus with our recently developed MFCC-MMSE noise suppressor applied. Our experiments show that the proposed CS-VPHMM outperforms the discriminatively trained and maximum-likelihood trained conventional HMMs with relative word error rate (WER) reduction of 14% and 20% respectively under the well-matched conditions when both mean and variances are updated.


#6 Towards a non-parametric acoustic model: an acoustic decision tree for observation probability calculation [PDF] [Copy] [Kimi1] [REL]

Authors: Jasha Droppo, Michael L. Seltzer, Alex Acero, Yu-Hsiang Bosco Chiu

Modern automatic speech recognition systems use Gaussian mixture models (GMM) on acoustic observations to model the probability of producing a given observation under any one of many hidden discrete phonetic states. This paper investigates the feasibility of using an acoustic decision tree to directly model these probabilities. Unlike the more common phonetic decision tree, which asks questions about phonetic context, an acoustic decision tree asks questions about the vector-valued observations. Three different types of acoustic questions are proposed and evaluated, including LDA, PCA, and MMI questions. Frame classification experiments are run on a subset of the Switchboard corpus. On these experiments, the acoustic decision tree produces slightly better results than maximum likelihood trained GMMs, with significantly less computation. Some theoretical advantages of the acoustic decision tree are discussed, including more economical use of the training data and reduced mismatch between the acoustic model and the true probability distribution of the phonetic labels.


#7 A shrinkage estimator for speech recognition with full covariance HMMs [PDF] [Copy] [Kimi1] [REL]

Authors: Peter Bell, Simon King

We consider the problem of parameter estimation in full-covariance Gaussian mixture systems for automatic speech recognition. Due to the high dimensionality of the acoustic feature vector, the standard sample covariance matrix has a high variance and is often poorly-conditioned when the amount of training data is limited. We explain how the use of a shrinkage estimator can solve these problems, and derive a formula for the optimal shrinkage intensity. We present results of experiments on a phone recognition task, showing that the estimator gives a performance improvement over a standard full-covariance system.


#8 Covariance updates for discriminative training by constrained line search [PDF] [Copy] [Kimi1] [REL]

Authors: Peter Bell, Simon King

We investigate the recent Constrained Line Search algorithm for discriminative training of HMMs and propose an alternative formula for variance update. We compare the method to standard techniques on a phone recognition task.


#9 Min-max discriminative training of decoding parameters using iterative linear programming [PDF] [Copy] [Kimi1] [REL]

Authors: Brian Mak, Tom Ko

In automatic speech recognition, the decoding parameters - grammar factor and word insertion penalty - are usually hand-tuned to give the best recognition performance. This paper investigates an automatic procedure to determine their values using an iterative linear programming (LP) algorithm. LP naturally implements discriminative training by mapping linear discriminants into LP constraints. A min-max cost function is also defined to get more stable and robust result. Empirical evaluations on the RM1 and WSJ0 speech recognition tasks show that decoding parameters found by the proposed algorithm are as good as those found by a brute-force grid search; their optimal values also seem to be independent of the initial values set to start the iterative LP algorithm.


#10 Discriminative training for complementariness in system combination [PDF] [Copy] [Kimi1] [REL]

Authors: Daniel Willett, Chuang He

In recent years, techniques of output combination from multiple speech recognizers for improved overall performance have gained popularity. Most commonly, the combined systems are established independently. This paper describes our attempt to directly target joint system performance in the discriminative training objective of acoustic model parameter estimation. It also states first promising results.


#11 Penalty function maximization for large margin HMM training [PDF] [Copy] [Kimi1] [REL]

Authors: George Saon, Daniel Povey

We perform large margin training of HMM acoustic parameters by maximizing a penalty function which combines two terms. The first term is a scale which gets multiplied with the Hamming distance between HMM state sequences to form a multi-label (or sequence) margin. The second term arises from constraints on the training data that the joint log-likelihoods of acoustic and correct word sequences exceed the joint log-likelihoods of acoustic and incorrect word sequences by at least the multi-label margin between the corresponding Viterbi state sequences. Using the soft-max trick, we collapse these constraints into a boosted MMI-like term. The resulting objective function can be efficiently maximized using extended Baum-Welch updates. Experimental results on multiple LVCSR tasks show a good correlation between the objective function and the word error rate.


#12 Implicit state-tying for support vector machines based speech recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Daniel Bolaños, Wayne Ward

In this article we take a step forward towards the application of Support Vector Machines (SVMs) to continuous speech recognition. As in previous work, we use SVMs to estimate emission probabilities in the context of an SVM/HMM system. However, training pairwise classifiers to discriminate between some of the HMM-states of very close phonetic classes produce unsatisfactory results. We propose a data-driven approach for selecting the HMM-states for which SVMs are trained and those ones that are implicitly tied.


#13 Using KL-based acoustic models in a large vocabulary recognition task [PDF] [Copy] [Kimi1] [REL]

Authors: Guillermo Aradilla, Hervé Bourlard, Mathew Magimai Doss

Posterior probabilities of sub-word units have been shown to be an effective front-end for ASR. However, attempts to model this type of features either do not benefit from modeling context-dependent phonemes, or use an inefficient distribution to estimate the state likelihood. This paper presents a novel acoustic model for posterior features that overcomes these limitations. The proposed model can be seen as a HMM where the score associated with each state is the KL divergence between a distribution characterizing the state and the posterior features from the test utterance. This KL-based acoustic model establishes a framework where other models for posterior features such as hybrid HMM/MLP and discrete HMM can be seen as particular cases. Experiments on the WSJ database show that the KL-based acoustic model can significantly outperform these latter approaches. Moreover, the proposed model can obtain comparable results to complex systems, such as HMM/GMM, using significantly fewer parameters.


#14 Acoustic modeling based on model structure annealing for speech recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Sayaka Shiota, Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee, Keiichi Tokuda

This paper proposes an HMM training technique using multiple phonetic decision trees and evaluates it in speech recognition. In the use of context dependent models, the decision tree based context clustering is applied to find a parameter tying structure. However, the clustering is usually performed based on statistics of HMM state sequences which are obtained by unreliable models without context clustering. To avoid this problem, we optimize the decision trees and HMM state sequences simultaneously. In the proposed method, this is performed by maximum likelihood (ML) estimation of a newly defined statistical model which includes multiple decision trees as hidden variables. Applying the deterministic annealing expectation maximization (DAEM) algorithm and using multiple decision trees in early stage of model training, state sequences are reliably estimated. In continuous phoneme recognition experiments, the proposed method can improve the recognition performance.


#15 Bayesian context clustering using cross valid prior distribution for HMM-based speech recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee, Keiichi Tokuda

This paper proposes a prior distribution determination technique using cross validation for speech recognition based on the Bayesian approach. The Bayesian method is a statistical technique for estimating reliable predictive distributions by marginalizing model parameters and its approximate version, the variational Bayesian method has been applied to HMM-based speech recognition. Since prior distributions representing prior information about model parameters affect the posterior distributions and model selection, the determination of prior distributions is an important problem. However, it has not been thoroughly investigate in speech recognition. The proposed method can determine reliable prior distributions without tuning parameters and select an appropriate model structure dependently on the amount of training data. Continuous phoneme recognition experiments show that the proposed method achieved a higher performance than the conventional methods.


#16 Speech recognition using soft decision trees [PDF] [Copy] [Kimi1] [REL]

Authors: Jitendra Ajmera, Masami Akamine

This paper presents recent developments at our site toward speech recognition using decision tree based acoustic models. Previously, robust decision trees have been shown to achieve better performance compared to standard Gaussian mixture model (GMM) acoustic models. This was achieved by converting hard questions (decisions) of a standard tree into soft questions using sigmoid function. In this paper, we report our work where soft-decision trees are trained from scratch. These soft-decision trees are shown to yield better speech recognition accuracy compared to standard GMM acoustic models on Aurora digit recognition task.


#17 GPU-accelerated Gaussian clustering for fMPE discriminative training [PDF] [Copy] [Kimi1] [REL]

Authors: Yu Shi, Frank Seide, Frank K. Soong

The Graphics Processing Unit (GPU) has extended its applications from its original graphic rendering to more general scientific computation. Through massive parallelization, state-of-the-art GPUs can deliver 200 billion floating-point operations per second (0.2 TFLOPS) on a single consumer-priced graphics card. This paper describes our attempt in leveraging GPUs for efficient HMM model training. We show that using GPUs for a specific example of Gaussian clustering, as required in fMPE, or feature-domain Minimum Phone Error discriminative training, can be highly desirable. The clustering of huge number of Gaussians is very time consuming due to the enormous model size in current LVCSR systems. Comparing an NVidia Geforce 8800 Ultra GPU against an Intel Pentium 4 implementation, we find that our brute-force GPU implementation is 14 times faster overall than a CPU implementation that uses approximate speed-up heuristics. GPU accelerated fMPE reduces the WER 6% relatively, compared to the maximum-likelihood trained baseline on two conversational-speech recognition tasks.


#18 Discriminative training using the trusted expectation maximization [PDF] [Copy] [Kimi1] [REL]

Authors: Yasser Hifny, Yuqing Gao

We present the Trusted Expectation-Maximization (TEM), a new discriminative training scheme, for speech recognition applications. In particular, the TEM algorithm may be used for Hidden Markov Models (HMMs) based discriminative training. The TEM algorithm has a form similar to the Expectation-Maximization (EM) algorithm, which is an efficient iterative procedure to perform maximum likelihood in the presence of hidden variables [1]. The TEM algorithm has been empirically shown to increase a rational objective function. In the concave regions of a rational function, it can be shown that the maximization steps of the TEM algorithm and the hypothesized EM algorithm are identical. In the TIMIT phone recognition task, preliminary experimental results show competitive optimization performance over the conventional discriminative training approaches (in terms of speech and accuracy).


#19 Maximum mutual information estimation with unlabeled data for phonetic classification [PDF] [Copy] [Kimi1] [REL]

Authors: Jui-Ting Huang, Mark Hasegawa-Johnson

This paper proposes a new training framework for mixed labeled and unlabeled data and evaluates it on the task of binary phonetic classification. Our training objective function combines Maximum Mutual Information (MMI) for labeled data and Maximum Likelihood (ML) for unlabeled data. Through the modified training objective, MMI estimates are smoothed with ML estimates obtained from unlabeled data. On the other hand, our training criterion can also help the existing model adapt to new speech characteristics from unlabeled speech. In our experiments of phonetic classification, there is a consistent reduction of error rate from MLE to MMIE with I-smoothing, and then to MMIE with unlabeled-smoothing. Error rates can be further reduced by transductive-MMIE. We also experimented with the gender-mismatched case, in which the best improvement shows MMIE with unlabeled data has a 9.3% absolute lower error rate than MLE and a 2.35% absolute lower error rate than MMIE with I-smoothing.


#20 Maximum accept and reject (MARS) training of HMM-GMM speech recognition systems [PDF] [Copy] [Kimi1] [REL]

Author: Vivek Tyagi

This paper describes a new discriminative HMM parameter estimation technique. It supplements the usual ML optimization function with the emission (accept) likelihood of the aligned state (phone) and the rejection likelihoods from the rest of the states (phones). Intuitively, this new optimization function takes into the account as to how well the other states are rejecting the current frame that has been aligned with a given state. This simple scheme, termed as Maximum Accept and Reject (MARS), implicitly brings in the discriminative information and hence performs better than the ML trained models. As is well known, maximum mutual information (MMI)[3, 4] training needs a language model (lattice), encoding all possible sentences[7, 9], that could occur in the test conditions. MMI training uses this language model (lattice) to identify the confusable segments of speech in the form of the so-called "denominator" state occupation statistics [7]. However, this implicitly ties the MMI trained acoustic model to a particular task-domain. MARS training does not face this constraint as it finds the confusable states at the frame level and hence does not use a language model (lattice) during training.


#21 Nonlinear mixture autoregressive hidden Markov models for speech recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Sundar Srinivasan, Tao Ma, Daniel May, Georgios Lazarou, Joseph Picone

Gaussian mixture models are a very successful method for modeling the output distribution of a state in a hidden Markov model (HMM). However, this approach is limited by the assumption that the dynamics of speech features are linear and can be modeled with static features and their derivatives. In this paper, a nonlinear mixture autoregressive model is used to model state output distributions (MAR-HMM). Estimation of model parameters is extended to handle vector features. MAR-HMMs are shown to provide superior performance to comparable Gaussian mixture model-based HMMs (GMM-HMM) with lower complexity on two pilot classification tasks.


#22 GPU accelerated acoustic likelihood computations [PDF] [Copy] [Kimi1] [REL]

Authors: Patrick Cardinal, Pierre Dumouchel, Gilles Boulianne, Michel Comeau

This paper introduces the use of Graphics Processors Unit (GPU) for computing acoustic likelihoods in a speech recognition system. In addition to their high availability, GPUs provide high computing performance at low cost. We have used a NVidia GeForce 8800GTX programmed with the CUDA (Compute Unified Device Architecture) which shows the GPU as a parallel coprocessor. The acoustic likelihoods are computed as dot products, operations for which GPUs are highly efficient. The implementation in our speech recognition system shows that GPU is 5x faster than the CPU SSE-based implementation. This improvement led to a speed up of 35% on a large vocabulary task.


#23 Nonnative speech recognition based on state-candidate bilingual model modification [PDF] [Copy] [Kimi1] [REL]

Authors: Qingqing Zhang, Ta Li, Jielin Pan, Yonghong Yan

The speech recognition accuracy has been observed to decrease for nonnative speakers, especially those who are just beginning to learn foreign language or who have heavy accents. This paper presents a novel bilingual model modification approach to improve nonnative speech recognition, considering these great variations of accented pronunciations. Each state of the baseline nonnative acoustic models is modified with several candidate states from the auxiliary acoustic models, which are trained by speakers' mother language. State mapping criterion and n-best candidates are investigated based on a grammar-constrained speech recognition system. Using the state-candidate bilingual model modification approach, compared to the nonnative acoustic models which have already been well trained by adaptation technique MAP, a Relative reduction of 7.87% in Phrase Error Rate (RPhrER) was further achieved.


#24 Prosodic and spectral features within segment-based acoustic modeling [PDF] [Copy] [Kimi1] [REL]

Authors: Björn Schuller, Xiaohua Zhang, Gerhard Rigoll

Apart from the usually employed MFCC, PLP, and energy feature information, also duration, low order formants, pitch, and center-of-gravity-based features are known to carry valuable information for phoneme recognition. This work investigates their individual performance within segment-based acoustic modeling. Also, experiments optimizing a feature space spanned by this set, exclusively, are reported, using CFSS feature space optimization and speaker adaptation. All tests are carried out with SVM on the open IFA-corpus of 47 Dutch hand-labeled phonemes with a total of 178k instances. Extensive speaker dependent vs. independent test-runs are discussed as well as four different speaking styles reaching from informal to formal: informal and retold story telling, and read aloud with fixed and variable content. Results show the potential of these rather uncommon features, as e.g. based on F3 or pitch.


#25 Unsupervised versus supervised training of acoustic models [PDF] [Copy] [Kimi1] [REL]

Authors: Jeff Ma, Richard Schwartz

In this paper we report unsupervised training experiments we have conducted on large amounts of the English Fisher conversational telephone speech. A great amount of work has been reported on unsupervised training, but the major difference of this work is that we compared behaviors of unsupervised training with supervised training on exactly the same data. This comparison reveals surprising results. First, as the amount of training data increases, unsupervised training, even bootstrapped with a very limited amount (1 hour) of manual data, improves recognition performance faster than supervised training does, and it converges to supervised training. Second, bootstrapping unsupervised training with more manual data is not of significance if a large amount of un-transcribed data is available.