| Total: 869
Fundamental research on electrical stimulation of the auditory pathways resulted in the Multiple Channel Cochlear Implant, a device which provides understanding of speech to severely-toprofoundly deaf people. The device, a miniaturized receiverstimulator with multiple electrodes fed with power and speech data through two separate aerials was first implanted in a patient in 1978 as a prototype, and since 1982, was commercially produced by Cochlear Limited, Australia. Speech processing is based on the discovery that the sensation at each electrode is "vowel-like". Initially, the second formant was coded as a place of stimulation, the sound pressure was coded as a current level, and the voicing frequency as a pulse rate. Further research showed that there were progressively better open-set word and sentence scores for the extraction of the first formant in addition to the second formant (the F0/F1/F2 processor), the addition of high fixed filter outputs (MULTIPEAK) and then finally 6 to 8 maximal filter outputs at low rates (SPEAK) and high rates (ACE). All the frequencies were coded on a place basis. World trials completed for the US FDA on late-deafened adults in 1985 and in 1990 on children from two years to 17 years proved that a 22-channel cochlear implant was safe and effective in enabling them to understand speech both with and without lip-reading.
Over the last few years, several groups have been developing models and algorithms for learning to predict the structure of complex data, sequences in particular, that extend well-known linear classification models and algorithms, such as logistic regression, the perceptron algorithm, and support vector machines. These methods combine the advantages of discriminative learning with those of probabilistic generative models like HMMs and probabilistic context-free grammars. I will introduce linear models for structure prediction and their simplest learning algorithms, and exemplify their benefits with applications to text and speech processing, including information extraction, parsing, and language modeling.
Spontaneous conversation is optimized for human-human communication, but differs in some important ways from the types of speech for which human language technology is often developed. This overview describes four fundamental properties of spontaneous speech that present challenges for spoken language applications because they violate assumptions often applied in automatic processing technology.
We propose an unsupervised dynamic language model (LM) adaptation framework using long-distance latent topic mixtures. The framework employs the Latent Dirichlet Allocation model (LDA) which models the latent topics of a document collection in an unsupervised and Bayesian fashion. In the LDA model, each word is modeled as a mixture of latent topics. Varying topics within a context can be modeled by re-sampling the mixture weights of the latent topics from a prior Dirichlet distribution. The model can be trained using the variational Bayes Expectation Maximization algorithm. During decoding, mixture weights of the latent topics are adapted dynamically using the hypotheses of previously decoded utterances. In our work, the LDA model is combined with the trigram language model using linear interpolation. We evaluated the approach on the CCTV episode of the RT04 Mandarin Broadcast News test set. Results show that the proposed approach reduces the perplexity by up to 15.4% relative and the character error rate by 4.9% relative depending on the size and setup of the training set.
The Hidden Vector State (HVS) model extends the basic Hidden Markov Model (HMM) by encoding each state as a vector of stack states but with restricted stack operations. The model uses a right branching stack automaton to assign valid stochastic parses to a word sequence from which the language model probability can be estimated. The model is completely data driven and is able to model classes from the data that reflect the hierarchical structures found in natural language. This paper describes the design and the implementation of the HVS language model [1], focusing on the practical issues of initialisation and training using Baum-Welch re-estimation whilst accommodating a large and dynamic state space. Results of experiments conducted using the ATIS corpus [2] show that the HVS language model reduces test set perplexity compared to standard class based language models.
In this paper, we present a class-based variable memory length Markov model and its learning algorithm. This is an extension of a variable memory length Markov model. Our model is based on a class-based probabilistic suffix tree, whose nodes have an automatically acquired word-class relation. We experimentally compared our new model with a word-based bi-gram model, a word-based tri-gram model, a class-based bi-gram model, and a word-based variable memory length Markov model. The results show that a class-based variable memory length Markov model outperforms the other models in perplexity and model size.
We present context-sensitive dynamic classes - a novel mechanism for integrating contextual information from spoken dialogue into a class n-gram language model. We exploit the dialogue system's information state to populate dynamic classes, thus percolating contextual constraints to the recognizer's language model in real time. We describe a technique for training a language model incorporating context-sensitive dynamic classes which considerably reduces word error rate under several conditions. Significantly, our technique does not partition the language model based on potentially artificial dialogue state distinctions; rather, it accommodates both strong and weak expectations via dynamic manipulation of a single model.
In this paper, we address the issue of generating language model training data during the initial stages of dialogue system development. The process begins with a large set of sentence templates, automatically adapted from other application domains. We propose two methods to filter the raw data set to achieve a desired probability distribution of the semantic content, both on the sentence level and on the class level. The first method utilizes user simulation technology, which obtains the probability model via an interplay between a probabilistic user model and the dialogue system. The second method synthesizes novel dialogue interactions by modeling after a small set of dialogues produced by the developers during the course of system refinement. We evaluated our methodology by speech recognition performance on a set of 520 unseen utterances from naive users interacting with a restaurant domain dialogue system.
Probabilistic latent semantic analysis (PLSA) is a popular approach to text modeling where the semantics and statistics in documents can be effectively captured. In this paper, a novel Bayesian PLSA framework is presented. We focus on exploiting the incremental learning algorithm for solving the updating problem of new domain articles. This algorithm is developed to improve text modeling by incrementally extracting the up-to-date latent semantic information to match the changing domains at run time. The expectation-maximization (EM) algorithm is applied to resolve the quasi- Bayes (QB) estimate of PLSA parameters. The online PLSA is constructed to accomplish parameter estimation as well as hyperparameter updating. Compared to standard PLSA using maximum likelihood estimate, the proposed QB approach is capable of performing dynamic document indexing and classification. Also, we present the maximum a posteriori PLSA for corrective training. Experiments on evaluating model perplexities and classification accuracies demonstrate the superiority of using Bayesian PLSA.
This paper presents a new discriminative language model based on the whole-sentence maximum entropy (ME) framework. In the proposed discriminative ME (DME) model, we exploit an integrated linguistic and acoustic model, which properly incorporates the features from n-gram model and acoustic log likelihoods of target and competing models. Through the constrained optimization of integrated model, we estimate DME language model for speech recognition. Attractively, we illustrate the relation between DME estimation and the maximum mutual information (MMI) estimation for language modeling. It is interesting to find that using the sentence-level log likelihood ratios of competing and target sentences as the acoustic features for ME language modeling is equivalent to performing MMI discriminative language modeling. In the experiments on speech recognition, we show that DME model achieved lower word error rate compared to conventional ME model.
Today's speech recognition systems are able to recognize arbitrary sentences over a large but finite vocabulary. However, many important speech recognition tasks feature an open, constantly changing vocabulary. (E.g. broadcast news transcription, translation of political debates, etc. Ideally, a system designed for such open vocabulary tasks would be able to recognize arbitrary, even previously unseen words. To some extent this can be achieved by using sub-lexical language models. We demonstrate that, by using a simple flat hybrid model, we can significantly improve a well-optimized state-of-the-art speech recognition system over a wide range of out-of-vocabulary rates.
We present a new language model adaptation framework integrated with error handling method to improve accuracy of speech recognition and performance of spoken language applications. The proposed error corrective language model adaptation approach exploits domain-specific language variations and recognition environment characteristics to provide robustness and adaptability for a spoken language system. We demonstrate some experiments of spoken dialogue tasks and empirical results which show an improvement of the accuracy for both speech recognition and spoken language understanding.
Automatic Speech Recognition systems integrate three main knowledge sources: acoustic models, pronunciation dictionary and language models. In contrast to common practices, where each source is optimized independently, then combined in a finite-state search space, we investigate here a training procedure which attempts to adjust (some of) the parameters after, rather than before, combination. To this end, we adapted a discriminative training procedure originally devised for language models to the more general case of arbitrary finite-state graphs. Preliminary experiments performed on a simple name recognition task demonstrate the potential of this approach and suggest possible improvements.
Large vocabulary continuous speech recognizers for English Broadcast News achieve today word error rates below 10%. An important factor for this success is the availability of large amounts of acoustic and language modeling training data. In this paper the recognition of French Broadcast News and English and Spanish parliament speeches is addressed, tasks for which less resources are available. A neural network language model is applied that takes better advantage of the limited amount of training data. This approach performs the estimation of the probabilities in a continuous space, allowing by this means smooth interpolations. Word error reduction of up to 0.9% absolute are reported with respect to a carefully tuned backoff language model trained on the same data.
One of the challenges in large vocabulary speech recognition is the availability of large amounts of data for training language models. In most state-of-the-art speech recognition systems, n-gram models with Kneser-Ney smoothing still prevail due to their simplicity and effectiveness. In this paper, we study the performance of a new language model, the random forest language model, in the IBM conversational telephony speech recognition system. We show that although the random forest language models are designed to deal with the data sparseness problem, they also achieve statistically significant improvements over n-gram models when the training data has over 500 million words.
This paper considers discriminative training of language models for large vocabulary continuous speech recognition. The minimum word error (MWE) criterion was explored to make use of the word confusion information as well as the local lexical constraints inherent in the acoustic training corpus, in conjunction with those constraints obtained from the background text corpus, for properly guiding the speech recognizer to separate the correct hypothesis from the competing ones. The underlying characteristics of the MWE-based approach were extensively investigated, and its performance was verified by comparison with the conventional maximum likelihood (ML) approaches as well. The speech recognition experiments were performed on the broadcast news collected in Taiwan.
State of the art Speech Recognition systems use statistical language modeling and in particular N-gram models to represent the language structure. The Arabic language has a rich morphology, which motivates the introduction of morphological constraints in the language model. Class-based N-gram models have shown satisfactory results, especially for language model adaptation and training from reduced datasets. They were also proven quite effective in their use of memory space. In this paper, we investigate a new morphological class-based language model. Morphological rules are used to derive the different words in a class from their stem. As morphological analyzer, a rule-based stemming method is proposed for the Arabic language. The language model has been evaluated on a database composed of articles from Lebanese newspaper Al-Nahar for the years 1998 and 1999. In addition, a linear interpolation between the N-gram model and the morphological model is also evaluated. Preliminary experiments detailed in this paper show satisfactory results.
It is shown that the enormous improvement in the size of disk storage space in recent years can be used to build multiple worddomain statistical language models, one for each significant word of a language. Each of these word-domain language models is a precise domain model for the relevant significant word and when combined appropriately they provide a highly specific domain language model for the language following a cache, even a short cache. A Multiple Word- Domain model based on 20,000 individual word language models has been constructed and tested on a Wall Street Journal Corpus. Improvements in perplexity, between 25% and 68%, over a base-line tri-gram model have been obtained in tests.
We work on adaptation schemes for language modeling well suited for limited resources scenarios. In order to take advantage of available out-of-domain corpora, language model adaptation using topic mixtures was investigated. This technique has not given good practical results in the past. In this paper, we have performed several modifications to an existing tree-based approach. The tree was obtained from the background corpus by means of partitional clustering. All the nodes were exploited in the adapted model, and non-erroneous in-domain transcriptions were used as the adaptation corpus. The modified technique yielded a 14% perplexity improvement in a bilingual BN task, outperforming several nonhierarchical approaches. A strategy for an early application of the language model allowed to translate this perplexity improvement into a 4% WER reduction.
The ability to build topic specific language models, rapidly and with minimal human effort, is a critical need for fast deployment and portability of ASR across different domains. The World Wide Web (WWW) promises to be an excellent textual data resource for creating topic specific language models. In this paper we describe an iterative web crawling approach which uses a competitive set of adaptive models comprised of a generic topic independent background language model, a noise model representing spurious text encountered in web based data (Webdata), and a topic specific model to generate query strings using a relative entropy based approach for WWW search engines and to weight the downloaded Webdata appropriately for building topic specific language models. We demonstrate how this system can be used to rapidly build language models for a specific domain given just an initial set of example utterances and how it can address the various issues attached with Webdata. In our experiments we were able to achieve a 20% reduction in perplexity for our target medical domain. The gains in perplexity translated to a 4% improvement in ASR word error rate (absolute) corresponding to a relative gain of 14%.
We present a novel trigger-based language model adaptation method oriented to the transcription of meetings. In meetings, the topic is focused and consistent throughout the whole session, therefore keywords can be correlated over long distances. The trigger-based language model is designed to capture such longdistance dependencies, but it is typically constructed from a large corpus, which is usually too general to derive task-dependent trigger pairs. In the proposed method, we make use of the initial speech recognition results to extract task-dependent trigger pairs and to estimate their statistics. Moreover, we introduce a back-off scheme that also exploits the statistics estimated from a large corpus. The proposed model reduced the test-set perplexity twice as much as the typical trigger-based language model constructed from a large corpus, and achieved a remarkable perplexity reduction of 41% over the baseline when combined with an adapted trigram language model.
In state-of-the-art large vocabulary automatic recognition systems, a large statistical language model is used, typically an N-gram. However in order to estimate this model, a large database of sentences or texts in the same style as the recognition task is needed. For spontaneous speech one doesn't dispose of such database since it should consist of accurate thus expensive orthographic transcriptions of spoken audio.
This article investigates the use of Internet news sources to automatically adapt the vocabulary of a French and an English broadcast news transcription system. A specific method is developed to gather training, development and test corpora from selected websites, normalizing them for further use. A vectorial vocabulary adaptation algorithm is described which interpolates word frequencies estimated on adaptation corpora to directly maximize lexical coverage on a development corpus. To test the generality of this approach, experiments were carried out simultaneously in French and in English (UK) on a daily basis for the month May 2004. In both languages, the OOV rate is reduced by more than a half.
Traditionally, when building an n-gram model, we decide the span of the model history, collect the relevant statistics and estimate the model. The model can be pruned down to a smaller size by manipulating the statistics or the estimated model. This paper shows how an n-gram model can be built by adding suitable sets of n-grams to a unigram model until desired complexity is reached. Very high order n-grams can be used in the model, since the need for handling the full unpruned model is eliminated by the proposed technique. We compare our growing method to entropy based pruning. In Finnish speech recognition tests, the models trained by the growing method outperform the entropy pruned models of similar size.
This work combines grammars and statistical language models for speech recognition together in the same sentence. The grammars are compiled into bigrams with word indices, which serve to distinguish different syntactic positions of the same word. For both the grammatical and statistical parts there is one common interface for obtaining a language model score for bi- or trigrams. With only a small modification to a recogniser prepared for statistical language models, this new model can be applied without using a parser or a finite-state network in the recogniser. Priority is given to the grammar, therefore the combined model is able to disallow certain word transitions. With this combined language model, one or several grammatical phrases can be embedded into longer sentences.