| Total: 107
Recently an unsupervised learning scheme for Hidden Markov Models (HMMs) used in acoustical Language Identification (LID) based on Parallel Phoneme Recognizers (PPR) was proposed. This avoids the high costs for orthographically transcribed speech data and phonetic lexica but was found to introduce a considerable increase of classification errors. Also very recently discriminative Minimum Language Identification Error (MLIDE) optimization of HMMs for PPR based LID was introduced that again only requires language tagged speech data and an initial HMM. The described work shows how to combine both approaches to an unsupervised and discriminative learning scheme. Experimental results on large telephone speech databases show that using MLIDE the relative increase in error rate introduced by unsupervised learning can be reduced from 61% to 26%. The absolute difference in LID error rate due to the supervised learning step is reduced from 4.1% to 0.8%.
This paper investigates combining contrastive acoustic models for parallel phonotactic language identification systems. PRLM, a typical phonotactic system, uses a phone recogniser to extract phonotactic information from the speech data. Combining multiple PRLM systems together forms a Parallel PRLM (PPRLM) system. A standard PPRLM system utilises multiple phone recognisers trained on different languages and phone sets to provide diversification. In this paper, a new approach for PPRLM is proposed where phone recognisers with different acoustic models are used for the parallel systems. The STC and SPAM precision matrix modelling schemes as well as the MMI training criterion are used to produce contrastive acoustic models. Preliminary experimental results are reported on the NIST language recognition evaluation sets. With only two training corpora, a 12-way PPRLM system, using different acoustic modelling schemes, outperformed the standard 2-way PPRLM system by 2.0-5.0% absolute EER.
In this paper we describe a novel use of a multi-layer Kohonen self-organizing feature map (MLKSFM) for spoken language identification (LID). A normalized, segment-based input feature vector is used in order to maintain the temporal information of speech signal. The LID is performed by using different system configurations of the MLKSFM. Compared with a baseline PPRLM system, our novel system is capable of achieving a similar identification rate, but requires less training time and no phone labeling of training data. The MLKSFM with the sheet-shaped map and the hexagonal-lattice neighborhoods relationship is found to give the best performance for the LID task, and this system is able to achieve a LID rate of 76.4% and 62.4% for the 45-sec and 10-sec OGI speech utterances, respectively.
Due to the limitation of single-level classification, existing fusion techniques experience difficulty in improving the performance of language identification when the number of languages and features are further increased. Given that the similarity of feature distribution between different languages may vary, we propose a novel hierarchical language identification framework with multi-level classification. In this approach, target languages are hierarchically clustered into groups according to the distance between them, models are trained both for individual languages and language groups, and classification is hierarchically done in multi-levels. This framework is implemented and evaluated in this paper, the results showing an relative 15.1% error-rate improvement in 30s case on OGI 10-language database compared to modern GMM fusion system.
This paper presents results on using rhythm for automatic language identification (LID). The idea is to explore the duration of pseudo-syllables as language discriminative feature. The resulting Rhythm system is based on Bigram duration models of neighbouring pseudo-syllables. The Rhythm system is fused with a Spectral system realized by parallel Phoneme Recognition (PPR) approach using MFCC's. The LID systems were evaluated on a 7 languages identification task using the Speech- Dat II databases. Tests were performed with 7 seconds utterances. Whereas the Spectral system acting as a baseline system achieved an error rate of 7.9% the fused system reduced the error rate by 10% relatively.
One of the most common approaches in language verification (LV) is the phonotactic language verification. Currently, LV performances for different languages under different environments and durations have to be compared experimentally and this can make it difficult to understand LV performances across corpora or durations. LV can be viewed as a special case of hypothesis testing such that Neyman-Pearson theorem and other information theoretic analysis are applicable. In this paper, we introduce a measure of phonotactic confusablity based on the phonotactic distribution, and make it possible to assess the difficulty of the verification problem analytically. We then propose a method of predicting LV performance. The effectiveness of the proposed approach is demonstrated on the NIST 2003 language recognition evaluation test set.
The bilingual nature of the Maltese Islands gives rise to frequent occurrences of code switching, both verbally and in writing. In designing a polyglot TTS system capable of handling SMS messages within the local context, it was necessary to come up with a pre-processing mechanism for identifying the language of origin of individual word tokens. Given that certain common words can be interlingually ambiguous and that the domain under consideration is both open and subject to containing various word contractions and spelling mistakes, the task is not as straightforward as it may seem at first. In this paper we discuss a language neutral language identification approach capable of handling the characteristics of the domain in a robust fashion.
One of the most popular and better performing approaches to language recognition (LR) is Parallel Phonetic Recognition followed by Language Modeling (PPRLM). In this paper we report several improvements in our PPRLM system that allowed us to move from an Equal Error Rate (EER) of over 15% to less than 8% on NIST LR Evaluation 2005 data still using a standard PPRLM system. The most successful improvement was the retraining of the phonetic decoders on larger and more appropriate corpora. We have also developed a new system based on Support Vector Machines (SVMs) that uses as features both Mel Frequency Cepstral Coefficients (MFCCs) and Shifted Delta Cepstra (SDC). This new SVM system alone gives an EER of 10.5% on NIST LRE 2005 data. Fusing our PPRLM system and the new SVM system we achieve an EER of 5.43% on NIST LRE 2005 data, a relative reduction of almost 66% from our baseline system.
This paper introduces a detection methodology for recognition technologies in speech for which it is difficult to obtain an abundance of non-target classes. An example is language recognition, where we would like to be able to measure the detection capability of a single target language without confounding with the modeling capability of non-target languages. The evaluation framework is based on a cross validation scheme leaving the non-target class out of the allowed training material for the detector. The framework allows us to use Detection Error Tradeoff curves properly. As another application example we apply the evaluation scheme to emotion recognition in order to obtain single-emotion detection performance assessment.
In this paper, we adopt the boosting framework to improve the performance of acoustic-based Gaussian mixture model (GMM) Language Identification (LID) systems. We introduce a set of low-complexity, boosted target and anti-models that are estimated from training data to improve class separation, and these models are integrated during the LID backend process. This results in a fast estimation process. Experiments were performed on the 12-language, NIST 2003 language recognition evaluation classification task using a GMM-acoustic-score-only LID system, as well as the one that combines GMM acoustic scores with sequence language model scores from GMM tokenization. Classification errors were reduced from 18.8% to 10.5% on the acoustic-score-only system, and from 11.3% to 7.8% on the combined acoustic and tokenization system.
Gaussian Mixture Models (GMMs) in combination with Support Vector Machine (SVM) classifiers have been shown to give excellent classification accuracy in speaker recognition.
The support vector machine (SVM) framework based on generalized linear discriminate sequence (GLDS) kernel has been shown effective and widely used in language identification tasks. In this paper, in order to compensate the distortions due to inter-speaker variability within the same language and solve the practical limitation of computer memory requested by large database training, multiple speaker group based discriminative classifiers are employed to map the cepstral features of speech utterances into discriminative language characterization score vectors (DLCSV). Furthermore, backend SVM classifiers are used to model the probability distribution of each target language in the DLCSV space and the output scores of backend classifiers are calibrated as the final language recognition scores by a pair-wise posterior probability estimation algorithm. The proposed SVM framework is evaluated on 2003 NIST Language Recognition Evaluation databases, achieving an equal error rate of 4.0% in 30-second tasks, which outperformed the state-of-art SVM system by more than 30% relative error reduction.
We present a novel approach for language identification based on a text categorization technique, namely an n-gram frequency ranking. We use a Parallel phone recognizer, the same as in PPRLM, but instead of the language model, we create a ranking with the most frequent n-grams, keeping only a fraction of them. Then we compute the distance between the input sentence ranking and each language ranking, based on the difference in relative positions for each n-gram. The objective of this ranking is to be able to model reliably a longer span than PPRLM, namely 5-gram instead of trigram, because this ranking will need less training data for a reliable estimation. We demonstrate that this approach overcomes PPRLM (6% relative improvement) due to the inclusion of 4-gram and 5-gram in the classifier. We present two alternatives: ranking with absolute values for the number of occurrences and ranking with discriminative values (11% relative improvement).
In recent evaluations of automatic language recognition systems, phonotactic approaches have proven highly effective [1][2]. However, as most of these systems rely on underlying ASR techniques to derive a phonetic tokenization, these techniques are potentially susceptible to acoustic variability from non-language sources (i.e. gender, speaker, channel, etc.). In this paper we apply techniques from ASR research to normalize and adapt HMM-based phonetic models to improve phonotactic language recognition performance. Experiments we conducted with these techniques show an EER reduction of 29% over traditional PRLM-based approaches.
The random forest language model (RFLM) has shown encouraging results in several automatic speech recognition (ASR) tasks but has been hindered by practical limitations, notably the space-complexity of RFLM estimation from large amounts of data. This paper addresses large-scale training and testing of the RFLM via an efficient disk-swapping strategy that exploits the recursive structure of a binary decision tree and the local access property of the tree-growing algorithm, redeeming the full potential of the RFLM, and opening avenues of further research, including useful comparisons with n-gram models. Benefits of this strategy are demonstrated by perplexity reduction and lattice rescoring experiments using a state-of-the-art ASR system.
A topic detection approach based on a probabilistic framework is proposed to realize topic adaptation of speech recognition systems for long speech archives such as meetings. Since topics in such speech are not clearly defined unlike news stories, we adopt a probabilistic representation of topics based on probabilistic latent semantic analysis (PLSA). A topical sub-space is constructed by PLSA, and speech segments are projected to the subspace, then each segment is represented by a vector which consists of topic probabilities obtained by the projection. Topic detection is performed by clustering these vectors, and topic adaptation is done by collecting relevant texts based on the similarity in this probabilistic representation. In experimental evaluations, the proposed approach demonstrated significant reduction of perplexity and out-of-vocabulary rates as well as robustness against ASR errors.
In this paper, we propose a PLSA-based language model for sports live speech. This model is implemented in unigram rescaling technique that combines a topic model and an n-gram. In conventional method, unigram rescaling is performed with a topic distribution estimated from a history of recognized transcription. This method can improve the performance; however it cannot express topic transition. Incorporating concept of topic transition, it is expected to improve the recognition performance. Thus the proposed method employs a "Topic HMM" instead of a history to estimate the topic distribution. The Topic HMM is a Discrete Ergodic HMM that expresses typical topic distributions and topic transition probabilities. Word accuracy results indicate an improvement over tri-gram and PLSA-based conventional method using a recognized history.
Good language modeling relies on good predefined lexicons. For Chinese, since there are no text word boundaries and the concept of "word" is not very well defined, constructing good lexicons is difficult. In this paper, we propose lexicon adaptation with reduced character error (LARCE), which learns new word tokens based on the criterion of reduced adaptation corpus error rate. In this approach, a multi-character string is taken as a new "word" as long as it is helpful in reducing the error rate, and minimum number of new, high-quality words can be obtained. This algorithm is based on character-based consensus networks. In initial experiments on Chinese broadcast news, it is shown that LARCE not only significantly outperforms PAT-tree-based word extraction algorithms, but even outperforms manually augmented lexicons. It is believed the concept is equally useful for other character-based languages.
Discriminative training techniques have been successfully developed for many pattern recognition applications. In speech recognition, discriminative training aims to minimize the metric of word error rate. However, in an information retrieval system, the best performance should be achieved by maximizing the average precision. In this paper, we construct the discriminative n-gram language model for information retrieval following the metric of minimum rank error (MRE) rather than the conventional metric of minimum classification error. In the optimization procedure, we maximize the average precision and estimate the language model towards attaining the smallest ranking loss. In the experiments on ad-hoc retrieval using TREC collections, the proposed MRE language model performs better than the maximum likelihood and the minimum classification error language models.
We investigate the integration of various language model adaptation approaches for a cross-genre adaptation task to improve Mandarin ASR system performance on a recently introduced new genre, broadcast conversation (BC). Various language model adaptation strategies are investigated and their efficacies are evaluated based on ASR performance, including unsupervised language model adaptation from ASR transcripts and ways to integrate supervised Maximum A Posteriori (MAP) and marginal adaptation within the unsupervised adaptation framework. We found that by effectively combining these adaptation approaches, we can achieve as much as 1.3% absolute gain (6% relative) on the final recognition error rate in the BC genre.
We propose a dynamic language model adaptation method that uses the temporal information from lecture slides for lecture speech recognition. The proposed method consists of two steps. First, the language model is adapted with the text information extracted from all the slides of a given lecture. Next, the text information of a given slide is extracted based on temporal information and used for local adaptation. Hence, the language model, used to recognize speech associated with the given slide changes dynamically from one slide to the next. We evaluated the proposed method with the speech data from four Japanese lecture courses. Our experiments show the effectiveness of our proposed method, especially for keyword detection. The F-measure error rate for lecture keywords was reduced by 2.4%.
Universities have long relied on written text to share knowledge. As more lectures are made available on-line, these must be accompanied by textual transcripts in order to provide the same access to information as textbooks. While Automatic Speech Recognition (ASR) is a cost-effective method to deliver transcriptions, its accuracy for lectures is not yet satisfactory. One approach for improving lecture ASR is to build smaller, topic-dependent Language Models (LMs) and combine them (through LM interpolation or hypothesis space combination) with general-purpose, large-vocabulary LMs. In this paper, we propose a simple solution for lecture ASR with similar or better Word Error Rate reductions (as well as topic-specific keyword identification accuracies) than combination-based approaches. Our method eliminates the need for two types of LMs by exploiting the lecture slides to collect a web corpus appropriate for modelling both the conversational and the topic-specific styles of lectures.
This paper presents a language model topic adaptation framework for highly inflected languages. In such languages, sub-word units are used as basic units for language modeling. Since such units carry little semantic information, they are not very suitable for topic adaptation. We propose to lemmatize the corpus of training documents before constructing a latent topic model. To adapt language model, we use few lemmatized training sentences to find a set of documents that are semantically close to the current document. Fast marginal adaptation of sub-word trigram language model is used for adapting the background model. Experiments on a set of Estonian test texts show that the proposed approach gives a 19% decrease in language model perplexity. A statistically significant decrease in perplexity is observed already when using just two sentences for adaptation. We also show that the model employing lemmatization gives consistently better results than the unlemmatized model.
We present an effort to perform topic mixture-based language model adaptation using latent Dirichlet allocation (LDA). We use probabilistic latent semantic analysis (PLSA) to automatically cluster a heterogeneous training corpus, and train an LDA model using the resultant topic-document assignments. Using this LDA model, we then construct topic-specific corpora at the utterance level for interpolation with a background language model during language model adaptation. We also present a novel iterative algorithm for LDA topic inference. Very encouraging results were obtained in preliminary experiments with broadcast news in Mandarin Chinese.
We propose a language modeling and adaptation framework using Bayesian structural maximum a posteriori (SMAP) principle, in which each n-gram event is embedded in a branch of a tree structure. The nodes in the first layer of this tree structure represent the unigrams, and those in the second layer represent the bigrams, and so on. Each node in the tree structure has an associated hyper-parameter representing the information about the prior distribution, and a count representing the number of times the word sequence occurs in the domain-specific data. In general, the hyper-parameters depend on the observation frequency of not only the node event but also its parent node of lower order n-gram event. Our automatic speech recognition experiments using the Wall Street Journal corpus verify that the proposed SMAP language model adaptation achieves a 5.6% relative improvement over maximum likelihood language models obtained with the same training and adaptation data sets.