| Total: 162
Phonetic decision trees are a key concept in acoustic modeling for large vocabulary continuous speech recognition. Although discriminative training has become a major line of research in speech recognition and all state-of-the-art acoustic models are trained discriminatively, the conventional phonetic decision tree approach still relies on the maximum likelihood principle. In this paper we develop a splitting criterion based on the minimization of the classification error. An improvement of more than 10% relative over a discriminatively trained baseline system on the Wall Street Journal corpus suggests that the proposed approach is promising.
Current speech recognition systems are often based on HMMs with state-clustered Gaussian Mixture Models (GMMs) to represent the context dependent output distributions. Though highly successful, the standard form of model does not exploit any relationships between the states, they each have separate model parameters. This paper describes a general class of model where the context-dependent state parameters are a transformed version of one, or more, canonical states. A number of published models sit within this framework, including, semi-continuous HMMs, subspace GMMs and the HMM error model. A set of preliminary experiments illustrating some of this model's properties using CMLLR transformations from the canonical state to the context dependent state are described.
Variational KL (varKL) divergence minimization was previously applied to restructuring acoustic models (AMs) using Gaussian mixture models by reducing their size while preserving their accuracy. In this paper, we derive a related varKL for exponential family mixture models (EMMs) and test its accuracy using the weighted local maximum likelihood agglomerative clustering technique. Minimizing varKL between a reference and a restructured AM led previously to the variational expectation maximization (varEM) algorithm; which we extend to EMMs. We present results on a clustering task using AMs trained on 50 hrs of Broadcast News (BN). EMMs are trained on fMMI-PLP features combined with frame level phone posterior probabilities given by the recently introduced sparse representation phone identification process. As we reduce model size, we test the word error rate using the standard BN test set and compare with baseline models of the same size, trained directly from data.
One of the difficult problems of acoustic modeling for Automatic Speech Recognition (ASR) is how to adequately model the wide variety of acoustic conditions which may be present in the data. The problem is especially acute for tasks such as Google Search by Voice, where the amount of speech available per transaction is small, and adaptation techniques start showing their limitations. As training data from a very large user population is available however, it is possible to identify and jointly model subsets of the data with similar acoustic qualities. We describe a technique which allows us to perform this modeling at scale on large amounts of data by learning a tree-structured partition of the acoustic space,and we demonstrate that we can significantly improve recognition accuracy in various conditions through unsupervised Maximum Mutual Information (MMI) training. Being fully unsupervised, this technique scales easily to increasing numbers of conditions.
Hidden Markov Models are widely used in speech recognition systems. Due to the co-articulation effects of continuous speech, context-dependent models have been found to yield performance improvements. One major issue with context-dependent acoustic modelling is the robust parameter estimation of unseen or rare models in the training data. Typically, decision tree state clustering is used to ensure that there are sufficient data for each physical state. Decision trees based on phonetic questions are used to cluster the states. In this paper, conditional random field (CRF) is used to perform probabilistic state clustering where phonetic questions are used as binary feature functions to predict the latent cluster weights. Experimental results on the Wall Street Journal reveals that CRF-based state clustering outperformed the conventional maximum likelihood decision tree state clustering with similar model complexities by about 10% relative.
We propose a novel approach of integrating template matching with statistical modeling to improve continuous speech recognition. We use multiple Gaussian Mixture Model (GMM) indices to represent each frame of speech templates, use agglomerative clustering to generate template representatives, and use log likelihood ratio as the local distance measure for DTW template matching in lattice rescoring. Experimental results on the TIMIT phone recognition task demonstrated that the proposed approach consistently improved several HMM baselines significantly, where the absolute accuracy gain was 1.69%~1.83% if all training templates were used, and the gain was 1.29%~1.37% if template representatives were used.
We employ a variant of the popular Adaboost algorithm to train multiple acoustic models such that the aggregate system exhibits improved performance over the individual recognizers. Each model is trained sequentially on re-weighted versions of the training data. At each iteration, the weights are decreased for the frames that are correctly decoded by the current system. These weights are then multiplied with the frame-level statistics for the decision trees and Gaussian mixture components of the next iteration system. The composite system uses a log-linear combination of HMM state observation likelihoods. We report experimental results on several broadcast news transcription setups which differ in the language being spoken (English and Arabic) and amounts of training data. Our findings suggest that significant gains can be obtained for small amounts of training data even after feature and model-space discriminative training.
Sparse representation phone identification features (SPIF) is a recently developed technique to obtain an estimate of phone posterior probabilities conditioned on an acoustic feature vector. In this paper, we explore incorporating SPIF phone posterior probability estimates in large vocabulary continuous speech recognition (LVCSR) task by including them as additional features of exponential densities that model the HMM state emission likelihoods. We compare our proposed approach to a number of other well known methods of combining feature streams or multiple LVCSR systems. Our experiments show that using exponential models to combine features results in a word error rate reduction of 0.5% absolute (18.7% down to 18.2%); this is comparable to best error rate reduction obtained from system combination methods, but without having to build multiple systems or tune the system combination weights.
In this paper, we propose to incorporate the widely used Multiple Layer Perceptron (MLP) features and discriminative training (DT) into our recent data-sampling based ensemble acoustic models to further improve the quality of the individual models as well as the diversity among the models. We also propose applying speaker-model distance based speaker clustering for data sampling to construct ensembles of acoustic models for speaker independent speech recognition. By using these methods on the speaker independent TIMIT phone recognition task, we have obtained a phoneme recognition accuracy of 77.1% on the TIMIT complete test set, an absolute improvement of 5.4% over our conventional HMM baseline system, making this one of the best reported results on the TIMIT continuous phoneme recognition task.
In this paper, we propose a new semi-supervised training method for Gaussian Mixture Models. We add a conditional entropy minimizer to the maximum mutual information criteria, which enables to incorporate unlabeled data in a discriminative training fashion. The training method is simple but surprisingly effective. The preconditioned conjugate gradient method provides a reasonable convergence rate for parameter update. The phonetic classification experiments on the TIMIT corpus demonstrate significant improvements due to unlabeled data via our training criteria.
This paper presents an experimental study of a maximum likelihood (ML) approach to irrelevant variability normalization (IVN) based training and unsupervised online adaptation for large vocabulary continuous speech recognition. A moving-window based frame labeling method is used for acoustic sniffing. The IVN-based approach achieves a 10% relative word error rate reduction over an ML-trained baseline system on a Switchboard-1 conversational telephone speech transcription task.
Generalized Discriminative Feature Transformation (GDFT) is a feature space discriminative training algorithm for automatic speech recognition (ASR). GDFT uses Lagrange relaxation to transform the constrained maximum likelihood linear regression (CMLLR) algorithm for feature space discriminative training. This paper presents recent improvements on GDFT, which are achieved by regularization to the optimization problem. The resulting algorithm is called regularized GDFT (rGDFT) and we show that many regularization and smoothing techniques developed for model space discriminative training are also applicable to feature space training. We evaluated rGDFT on a real-time Iraqi ASR system and also on a large scale Arabic ASR task.
In this paper we describe parallel implementation of ANN training procedure based on block mode back-propagation learning algorithm. Two different approaches to training parallelization were implemented. The first is data parallelization using POSIX threads, it is suitable for multi-core computers. The second is node parallelization using high performance SIMD architecture of GPU with CUDA, suitable for CUDA enabled computers. We compare the speedup of both approaches by learning typically-sized network on the real-world phoneme-state classification task, showing nearly 10 times reduction when using CUDA version, while the 8-core server with multi-thread version gives only 4 times reduction. In both cases we compared to an already BLAS optimized implementation. The training tool will be released as Open-Source software under project name TNet.
In unsupervised training of ASR systems, no annotated data are assumed to exist. Word-level annotations for training audio are generated iteratively using an ASR system. At each iteration a subset of data judged as having the most reliable transcriptions is selected to train the next set of acoustic models. Data selection however remains a difficult problem, particularly when the error rate of the recognizer providing the initial annotation is very high. In this paper we propose an iterative algorithm that uses a combination of likelihoods and a simple model of sense to select data. We show that the algorithm is effective for unsupervised training of acoustic models, particularly when the initial annotation is highly erroneous. Experiments conducted on Fisher-1 data using initial models from Switchboard, and a vocabulary and LM derived from the Google N-grams, show that performance on a selected held-out test set from Fisher data improves when we use the proposed iterative approach.
In this paper, we propose a novel boosted mixture learning (BML) framework for Gaussian mixture HMMs in speech recognition. BML is an incremental method to learn mixture models for classification problem. In each step of BML, one new mixture component is calculated according to functional gradient of an objective function to ensure that it is added along the direction to maximize the objective function the most. Several techniques have been proposed to extend BML from simple mixture models like Gaussian mixture model (GMM) to Gaussian mixture hidden Markov model (HMM), including Viterbi approximation to obtain state segmentation, weight decay to initialize sample weights to avoid overfitting, combining partial updating with global updating of parameters and using Bayesian information criterion (BIC) for parsimonious modeling. Experimental results on the WSJ0 task have shown that the proposed BML yields relative word and sentence error rate reduction of 10.9% and 12.9%, respectively, over the conventional training procedure.
Linear dynamic models (LDMs) have been shown to be a viable alternative to hidden Markov models (HMMs) on small-vocabulary recognition tasks, such as phone classification. In this paper we investigate various statistical model combination approaches for a hybrid HMM-LDM recognizer, resulting in a phone classification performance that outperforms the best individual classifier. Further, we report on continuous speech recognition experiments on the AURORA4 corpus, where the model combination is carried out on wordgraph rescoring. While the hybrid system improves the HMM system in the case of monophone HMMs, the performance of the triphone HMM model could not be improved by monophone LDMs, asking for the need to introduce context-dependency also in the LDM model inventory.
Speech recognition based on connectionist approaches is one of the most successful alternatives to widespread Gaussian systems. One of the main claims against hybrid recognizers is the increased complexity for context-dependent phone modeling, which is a key aspect in medium to large size vocabulary tasks. In this paper, we investigate the use of context-dependent triphone models in a connectionist speech recognizer. Thus, most common triphone state clustering procedures for Gaussian models are compared and applied to our hybrid recognizer. The developed systems with clustered context-dependent triphones show above 20% relative word error rate reduction compared to a baseline hybrid system in two selected WSJ evaluation test sets. Additionally, the recent porting efforts of the proposed context modelling approaches to a LVCSR system for English Broadcast News transcription are reported.
We present a realization method of the principle of minimum relative entropy discrimination (MRED) in order to derive a regularized discriminative training method. MRED is advantageous since it provides a Bayesian interpretations of the conventional discriminative training methods and regularization techniques. In order to realize MRED for speech recognition, we proposed an approximation method of MRED that strictly preserves the constraints used in MRED. Further, in order to practically perform MRED, an optimization method based on convex optimization and its solver based on the cutting plane algorithm are also proposed. The proposed methods were evaluated on continuous phoneme recognition tasks. We confirmed that the MRED-based training system outperformed conventional discriminative training methods in the experiments.
In large vocabulary continuous speech recognition, decision trees are widely used to cluster triphone states. In addition to commonly used phonetically based questions, others have proposed additional questions such as phone position within word or syllable. This paper examines using the word or syllable context itself as a feature in the decision tree, providing an elegant way of introducing word- or syllable-specific models into the system. Positive results are reported on two state-of-the-art systems: voicemail transcription and a search by voice tasks across a variety of acoustic model and training set sizes.
This paper describes a novel technique to exploit duration information for low resource speech recognition systems. Using explicit duration models significantly increases computational cost due to a large search space. To avoid this problem, most of techniques using duration information adopt two-pass and N-best re-scoring approaches. Meanwhile, we propose an algorithm using word duration models with incremental speech rate normalization for the one-pass decoding approach. In the proposed technique, penalties are only added to scores of words with outlier durations, and not all words need to have duration models. Experimental results show that the proposed technique reduces up to 17% of errors on in-car digit string tasks without significant increase in computational cost.
In this paper we introduce a novel hybrid model architecture for speech recognition and investigate its noise robustness on the Aurora 2 database. Our model is composed of a bidirectional Long Short-Term Memory (BLSTM) recurrent neural net exploiting long-range context information for phoneme prediction and a Dynamic Bayesian Network (DBN) for decoding. The DBN is able to learn pronunciation variants as well as typical phoneme confusions of the BLSTM predictor in order to compensate signal disturbances. Unlike conventional Hidden Markov Model (HMM) systems, the proposed architecture is not based on Gaussian mixture modeling. Even without any feature enhancement, our BLSTM-DBN system outperforms a baseline HMM recognizer by up to 18%.
One-model speech recognition (SR) and synthesis (SS) based on common articulatory movement model are described. The SR engine has an articulatory feature (AF) extractor and an HMM based classifier that models articulatory gestures. Experimental results of a phoneme recognition task show that AF outperforms MFCC even if the training data are limited to a single speaker. In the SS engine, the same speaker-invariant HMM is applied to generate an AF sequence, and then after converting AFs into vocal tract parameters, speech signal is synthesized by a PARCOR filter together with a residual signal. Phoneme-to-phoneme speech conversion using AF exchange is also described.
This paper investigates an acoustic modeling approach for low-resourced languages based on bootstrap and model restructuring. The approach first creates an acoustic model with redundancy by averaging over bootstrapped models from resampled subsets of sparse training data, which is followed by model restructuring to scale down the model to a desired cardinality. A variety of techniques for Gaussian clustering and model refinement are discussed for the model restructuring. LVCSR experiments are carried out on Pashto language with up to 105 hours of training data. The proposed approach is shown to yield more robust acoustic models given sparse training data and obtain superior performance over the traditional training procedure.
The aim of this work is to improve the performance of lecture speech recognition by using a system combination approach. In this paper, we propose a new combination technique in which various types of acoustic models are combined. In the combination approach, the use of complementary information is important. In order to prepare acoustic models that incorporate a variety of acoustic features, we employ both continuous-mixture hidden Markov models (CMHMMs) and discrete-mixture hidden Markov models (DMHMMs). These models have different patterns of recognition errors. In addition, we propose a new maximum mutual information (MMI) estimation of the DMHMM parameters. In order to evaluate the performance of the proposed method, we conduct recognition experiments on "Corpus of Spontaneous Japanese." In the experiments, a combination of CMHMMs and DMHMMs whose parameters were estimated by using the MMI criterion exhibited the best recognition performance.
Recently, trajectory HMM has been shown to improve the performance of both speech recognition and speech synthesis. For efficiency, state sequence is required to compute likelihood for trajectory HMM which limits its use to N-best rescoring for speech recognition. Motivated by the success of models with temporally varying parameters, this paper proposes a Temporally Varying Feature Mapping (TVFM) model to transform the feature vector sequence such that the trajectory information as modelled by trajectory HMM is suppressed. Therefore, TVFM can be perceived as an implicit trajectory modelling technique. Two approaches for estimating the TVFM parameters are presented. Experimental results for phone recognition on TIMIT and word recognition on Wall Street Journal show that promising results can be obtained using TVFM.