| Total: 183
This paper proposes a new phone lattice based method for automatic language recognition from speech data. By using phone lattices some approximations usually made by language identification (LID) systems relying on phonotactic constraints to simplify the training and decoding processes can be avoided. We demonstrate the use of phone lattices both in training and testing significantly improves the accuracy of a phonotactically based LID system. Performance is further enhanced by using a neural network to combine the results of multiple phone recognizers. Using three phone recognizers with context independent phone models, the system achieves an equal error rate of 2.7% on the Eval03 NIST detection test (30s segment, primary condition) with an overall decoding process that runs faster than real-time (0.5xRT).
This paper introduces a new metric for the quantitative assessment of the similarity of speakers' accents. The ACCDIST metric is based on the correlation of inter-segment distance tables across speakers or groups. Basing the metric on segment similarity within a speaker ensures that it is sensitive to the speaker's pronunciation system rather than to his or her voice characteristics. The metric is shown to have an error rate of only 11% on the accent classification of speakers into 14 English regional accents of the British Isles, half the error rate of a metric based on spectral information directly. The metric may also be useful for cluster analysis of accent groups.
In this paper we investigate the utility of three aspects of named entity processing: detection, localization and value extraction. We corroborate this task categorization by providing examples of practical applications for each of these subtasks. We also suggest methods for tackling these subtasks, giving particular attention to working with speech data. We employ Support Vector Machines to solve the detection task and show how localization and value extraction can successfully be dealt with using a combination of grammar-based and statistical methods.
This paper shows the common framework that underlies the translation systems based on phrases or driven by finite state transducers, and summarizes a first comparison between them. In both approaches the translation process is based on pairs of source and target strings of words (segments) related by word alignment. Their main difference comes from the statistical modeling of the translation context. The experimental study has been carried out on an English/Spanish version of the VERBMOBIL corpus. Under the constrain of a monotone composition of translated segments to generate the target sentence, the finite state based translation outperforms the phrase based counterpart.
In this paper we present first experiments towards a tighter coupling between Automatic Speech Recognition (ASR) and Statistical Machine Translation (SMT) to improve the overall performance of our speech translation system. In coventional speech translation systems, the recognizer outputs a single hypothesis which is then translated by the SMT system. This approach has the limitation of being largely dependent on the word error rate of the first best hypothesis. The word error rate is typically lowered by generating many alternative hypotheses in the form of a word lattice. The information in the word lattice and the scores from the recognizer can be used by the translation system to obtain better performance. In our experiments, by switching from the single best hypotheses to word lattices as the interface between ASR and SMT, and by introducing weighted acoustic scores in the translation system, the overall performance was increased by 16.22%.
Adequate confirmation is indispensable in spoken dialog systems to eliminate misunderstandings caused by speech recognition errors. Spoken language also inherently includes redundant expressions such as disfluency and out-of-domain phrases, which do not contribute to task achievement. It is easy to define a set of keywords to be confirmed for conventional database query tasks, but not straightforward in general document retrieval tasks. In this paper, we propose two statistical measures for identifying portions to be confirmed. A relevance score (RS) represents matching degree with the document set. A significance score (SS) detects portions that consequently affect the retrieval results. With these measures, the system can generate confirmation prior to and posterior to the retrieval, respectively. The strategy is implemented and evaluated with retrieval from software support knowledge base of 40K entries. It is shown that the proposed strategy using the two measures is more efficient than using the conventional confidence measure.
This paper presents the developed open-vocabulary spoken document retrieval system including the newly proposed subphonetic segment(SPS) unit and combining multilayer subword units. There are two principal approaches to the task of Spoken Document Retrieval(SDR), the subword-based approach and the word-based approach. An inevitable problem of this approach is the fact that the vocabulary size is limited. An alternative approach is to perform retrieval by subword-based transcriptions resulted from a subword recognizer. Subword-based SDR has the advantages that the recognizer is less expensive and open-vocabulary retrieval is possible, because the recognition component is not bound to any vocabulary. Our approach to SDR is based on a subword recognizer which initially transforms the spoken documents into subword sequences. From the experimental evaluation on the Japanese retrieval we confirmed that using the proposed SPS unit and the combination of multilayer subword units is effective for open-vocabulary spoken document retrieval.
This paper proposes a new efficient partial matching algorithm, called Island Driven Partial Matching (IDPM) based on Continuous Dynamic Programming (CDP), to realize flexible retrieval from a speech database by query speech. IDPM enables detecting the sections in the speech database which match partial sections of the query speech efficiently. IDPM applies CDP to short and constant length of unit reference patterns, which are composed of the query speech, and finds the best matching island sections in the speech database. Arbitrary lengths of similar sections are detected by only checking those islands sections. Some experiments were conducted for conversational speech and the results showed IDPM enables the fast matching between arbitrary sections of the reference pattern and the input speech without declining the performance in detecting similar sections compared with our former method.
We present in this paper a new method of language detection, which allows detection on telephone speech with short sentences (3 seconds). The modeling based on neural networks allows to model any language without acoustic decomposition of the speech signal. We applied our system for a detection task over 11 languages contained in the OGI MLTS corpus with 3 seconds duration signals. The results obtained with our design allow to detect languages with an average competitive rate of 77% for a real time application running on a P4 1.7GHz based platform.
Automatic language identification has become an important issue in recent years in speech recognition systems. In this paper, we present the work done in language identification for an air traffic control speech recognizer for continuous speech. The system is able to distinguish between Spanish and English. We present several language identification techniques based on full recognition that improve the baseline results obtained using the most commonly known "PPRLM" technique. We have in our database some task specific critical problems for language identification like non native speakers, extremely spontaneous speech or Spanish-English mix in the same sentence. We confirm that PPRLM is quite sensible to those problems and that a technique based on a Bayesian classifier is the one with the best performance in spite of its higher computational cost.
In this paper, we present our recent work in the analysis and modeling of speech under dialect. Dialect and accent significantly influence automatic speech recognition performance, and therefore it is critical to detect and classify non-native speech. In this study, we consider three areas that include: (i) prosodic structure (normalized f0, syllable rate, and sentence duration), (ii) phoneme acoustic space modeling and sub-word classification, and (iii) word-level based modeling using large vocabulary data. The corpora used in this study include: the NATO N-4 corpus (2 accents, 2 dialects of English), TIMIT (7 dialect regions), and American and British English versions of the WSJ corpus. These corpora were selected because the contained audio material from specific dialects/accents of English (N- 4), were phonetically balanced and organized across U.S. (TIMIT), or contained significant amounts of read audio material from distinct dialects (WSJ). The results show that significant changes occur at the prosodic, phoneme space, and word levels for dialect analysis, and that effective dialect classification can be achieved using processing strategies from each domain.
Duration features have been thought to be the most obvious correlates of speech rhythm. Previous studies have shown that they can be used to distinguish among some world's languages. This paper investigates to what extent the methods employed in these studies can be applied to the dialects of British English. We have tested whether a set of variables derived from automatically extracted duration measurements constitute reliable predictors that could be used for automatic dialect identification. Preliminary results show that the automatic procedure, combined with the high interspeaker variability, yield overlapping rather than crisp dialectal categories.
Portable devices such as PDA phones and smart phones are increasingly popular. Many of these devices already have voice dialing capability. The next step is to offer more powerful personal-assistant features such as speech translation. In this paper, we propose a system that can translate speech commands in Chinese into English, in real-time, on small, portable devices with limited memory and computational power. We address the various computational and platform issues of speech recognition and translation on portable devices. We propose fixed-point computation, discrete front-end speech features, bi-phone acoustic models, grammar-based speech decoding, and unambiguous inversion transduction grammars for transfer-based translation. As a result, our speech translation system requires only 500k memory and a 200MHz CPU.
We present an efficient and effective method which extends the Boosting family of classifiers to allow the weighted classes. Typically classifiers do not treat individual classes separately. For most real world applications, this is not the case, not all classes have the same importance. The accuracy of a particular class can be more critical than others. In this paper we extend the mathematical formulation for Boosting to weigh the classes differently during training. We have evaluated this method for call classification in AT&T spoken language understanding system. Our results indicate significant improvements in the "important" classes without a significant loss in the overall performance.
This paper presents a baseline spoken document retrieval system in Finnish. Due to its agglutinative structure, Finnish speech can not be adequately transcribed using the standard large vocabulary continuous speech recognition approaches. The definition of a sufficient lexicon and the training of the statistical language models are difficult, because the words appear transformed by many inflections and compounds. In this work we apply a recently developed unlimited vocabulary speech recognition system that allows the use of n-gram language models based on morpheme-like subword units discovered in an unsupervised manner. In addition to word-based indexing, we also propose an indexing based on the subword units provided directly by our speech recognizer. In an initial evaluation of newsreading in Finnish, we obtained a fairly low recognition error rate and average document retrieval precisions close to that from human reference transcripts.
In this paper, we propose to use a discriminative training method to improve naive Bayes classifiers(NBC) in context of natural language call routing. As opposed to the traditional maximum likelihood estimation, all conditional probabilties in Naive Bayes classifers are estimated discriminatively based on the minimum classification error criterion. A smoothed classification error rate in training set is formulated as an objective function and the generalized probabilistic descent method is used to minimize the objective function with respect to all conditional probabilities in NBCs. Two versions of NBC are used in this work. In the first version all NBCs corresponding to various destinations use the same word feature set while destination-dependent feature set is chosen for each destination in the second version. Experimental results on a banking call routing task show that the discriminative training method can achieve up to about 30% error reduction over our best ML-trained system.
This paper presents a phone-based approach of spoken document retrieval (SDR), developed in the framework of the emerging MPEG-7 standard. We describe an indexing and retrieval system that uses phonetic information only. The retrieval method is based on the vector space IR model, using phone N-grams as indexing terms. We propose a technique to expand the representation of documents by means of phone confusion probabilities in order to improve the retrieval performance. This method is tested on a collection of short German spoken documents, using 10 city names as queries.
Named entity recognition is important in a sophisticated information service such as Question-Answering and Text-Mining since most of the answer type and text mining unit depend on the named entity. Korean named entity recognition is difficult since each word of named entity has not specific features such as the capitalizing feature of English which represents named entity distinctly. In addition, since named entities are usually unknown words, the result of the morphological analyzer would make an error. Considering these problems this paper proposes hybrid named entity recognition system for Question-Answering which is constructed on people domain of encyclopedia. Finally, the experiment shows the soundness of proposed hybrid NER system which consists of determining word features, syllable based NER, pattern-rule based NER and statistical NER.
This paper presents overview of an online audio indexing system, which creates a searchable index of speech content embedded in digitized audio files. This system is based on our recently proposed offline audio segmentation techniques. As the data arrives continuously, the system first finds boundaries of the acoustically homogenous segments. Next, each of these segments is classified as speech, music or mixture classes, where mixtures are defined as regions where speech and other non-speech sounds are present simultaneously and noticeably. The speech segments are then clustered together to provide consistent speaker labels. The speech and mixture segments are converted to text via an ASR system. The resulting words are time-stamped together with other metadata information (speaker identity, speech confidence score) in an XML file to rapidly identify and access target segments. In this paper, we analyze the performance at each stage of this audio indexing system and also compare it with the performance of the corresponding offline modules.
The automatic transcription of German football commentaries and the analysis thereof are described. Histogram normalisation was used to improve the transcription of the very noisy data. The recognition of player names and ontology words was also investigated, since these are of crucial importance for the information retrieval task for which the transcriptions were used.
This paper describes on-going work related to the topic segmentation and indexation module of an alert system for selective dissemination of multimedia information. This system was submitted in the past year to a field trial which exposed a number of issues that should be dealt with in order to improve its performance. Some of our efforts involved the use of multiple topics, confidence measures and named entity extraction. This paper discusses these approaches and the corresponding results which, unfortunately, are still affected by the limited amount of topic-annotated training data.
This paper raises questions about the discrete or continuous nature of rhythm classes. Within this framework, our study investigates speech rhythm in the different Arabic dialects that have been constantly described as stress-timed compared with other languages belonging to different rhythm categories. Preliminary evidence from perceptual experiments revealed that listeners use speech rhythm cues to distinguish speakers of Arabic from North Africa from those of the Middle East. In an attempt to elucidate the reasons for this perceptual discrimination, an acoustic investigation based on duration measurement was carried out (i.e. percentages of vocalic intervals (%V) and the standard deviation of consonantal intervals (DC)). This experiment reveals that despite their rhythmic differences, all Arabic dialects still cluster around stress-timed languages exhibiting a different distribution from languages belonging to other rhythm categories such as French and Catalan. Besides, our study suggests that there is no such thing as clear-cut rhythm classes but rather overlapping categories. As a means of comparison, we also used Pairwise Variability Indices so as to validate the reliability of our findings.
This paper presents a hybrid approach for audio segmentation, in which the metric-based segmentation with long sliding windows is applied first to segment an audio stream into shorter sub-segments, and then the divide-and-conquer segmentation is applied to a fixed-length window that slides from the beginning to the end of each sub-segment to sequentially detect the remaining acoustic changes. The experimental results on five one-hour broadcast news shows show that our approach outperforms the existing metric-based and model-selection-based approaches.
Information retrieval which aims to provide people with easy access to all kinds of information is now becoming more and more emphasized. However, most approaches to information retrieval are primarily based on literal term matching and operate in a deterministic manner. Thus their performance is often limited due to the problems of vocabulary mismatch and not able to be steadily improved through use. In order to overcome these drawbacks as well as to enhance the retrieval performance, in this paper we explore the use of topical mixture model for statistical Chinese spoken document retrieval. Various kinds of model structures and learning approaches were extensively investigated. In addition, the retrieval capabilities were verified by comparison with the conventional vector space model and latent semantic indexing model, as well as our previously presented HMM/N-gram retrieval model. The experiments were performed on the TDT-2 Chinese collection. Very encouraging retrieval performance was obtained.
This paper presents speech-driven Web retrieval models which accepts spoken search topics (queries) in the NTCIR-3 Web retrieval task. We experimentally evaluate the techniques of combining outputs of multiple LVCSR models with a language model(LM) with a 60,000 vocabulary size in recognition of spoken queries. As model combination techniques, we use the SVM learning. We show that the techniques of multiple LVCSR model combination can achieve improvement both in speech recognition and retrieval accuracies in speech-driven text retrieval. Comparing with the etrieval accuracies when a LM with a 20,000/60,000 vocabulary size is used in LVCSRs, the LM that has larger size of the vocabulary improves also retrieval accuracies.