| Total: 373
In this article, we present an approach for the construction of a stochastic dialog manager, in which the system answer is selected by means of a classification procedure. In particular, we use neural networks for the implementation of this classification process, which takes into account the data supplied by the user and the last system turn. The stochastic model is automatically learnt from training data which are labeled in terms of dialog acts. An important characteristic of this approach is the introduction of a partition in the space of sequences of dialog acts in order to deal with the scarcity of available training data. This system has been developed in the DIHANA project, whose goal is the design and development of a dialog system to access a railway information system using spontaneous speech in Spanish. An evaluation of this approach is also presented.
We study dependencies between discourse structure and speech recognition problems (SRP) in a corpus of speech-based computer tutoring dialogues. This analysis can inform us whether there are places in the discourse structure prone to more SRP. We automatically extract the discourse structure by taking advantage of how the tutoring information is encoded in our system. To quantify the discourse structure, we extract two features for each system turn: depth of the turn in the discourse structure and the type of transition from the previous turn to the current turn. The χ2 test is used to find significant dependencies. We find several interesting interactions which suggest that the discourse structure can play an important role in several dialogue related tasks: automatic detection of SRP and analyzing spoken dialogues systems with a large state space from limited amounts of available data.
Our goal is to automatically detect boundaries between discussions of different topics in meetings. Towards this end we adapt the TextTiling algorithm [1] to the context of meetings. Our features include not only the overlapped words between adjacent windows, but also overlaps in the amount of speech contributed by each meeting participant. We evaluate our algorithm by comparing the automatically detected boundaries with the true ones, and computing precision, recall and f-measure. We report average precision of 0.85 and recall of 0.59 when segmenting unseen test meetings. Error analysis of our results shows that although the basic idea of our algorithm is sound, it breaks down when participants stray from typical behavior (such as when they monopolize the conversation for too long).
During the last years speech dialog systems have proven valuable for controlling infotainment systems in cars. The increasing popularity of mobile MP3 players and their capability to store more and more MP3, suggest the development of a speech control function for car MP3 players. With technical advantages in hardware and speech recognition it is possible to focus on the development of an intuitive speech dialog, which ensures that the advantages of speech control can be utilized while disadvantages are minimal at the same time. This paper describes the user-centered development of such an intuitive dialog, its special challenges and the reached quality in usability and ease of use. Also the generalizability of the results is discussed.
This paper describes our work with Let's Go, a telephone-based bus schedule information system that has been in use by the Pittsburgh population since March 2005. Results from several studies show that while task success correlates strongly with speech recognition accuracy, other aspects of dialogue such as turn-taking, the set of error recovery strategies, and the initiative style also significantly impact system performance and user behavior.
Current speech-enabled Intelligent Tutoring Systems do not model student question behavior the way human tutors do, despite evidence indicating the importance of doing so. Our study examined a corpus of spoken tutorial dialogues collected for development of ITSpoke, an Intelligent Tutoring Spoken Dialogue System. The authors extracted prosodic, lexical, syntactic, and student and task dependent information from student turns. Results of running 5-fold cross validation machine learning experiments using AdaBoosted C4.5 decision trees show prediction of student question-bearing turns at a rate of 79.7%. The most useful features were prosodic, especially the pitch slope of the last 200 milliseconds of the student turn. Student pre-test score was the most-used feature. Findings indicate that using turn-based units is acceptable for incorporating question detection capability into practical Intelligent Tutoring Systems.
In this work, the RWTH automatic speech recognition systems developed for the second TC-STAR evaluation campaign 2006 are presented. The systems were designed to transcribe parliamentary speeches taken from the European Parliament Plenary Sessions (EPPS) in European English and Spanish, as well as speeches from the Spanish Parliament. The RWTH systems apply a two pass search strategy with a fourgram one-pass decoder including a fast vocal tract length normalization variant as first pass. The systems further include several adaptation and normalization methods, minimum classification error trained models, and bayes risk minimization. For all relevant individual components contrastive results are presented on the EPPS Spanish and English data.
In this paper we present an automated approach for non-native speech recognition. We introduce a new phonetic confusion concept that associates sequences of native language (NL) phones to spoken language (SL) phones. Phonetic confusion rules are automatically extracted from a non-native speech database for a given NL and SL using both NLs and SLs ASR systems. These rules are used to modify the acoustic models (HMMs) of SLs ASR by adding acoustic models of NLs phones according to these rules. As pronunciation errors that non-native speakers produce depend on the writing of the words, we have also used graphemic constraints in the phonetic confusion extraction process. In the lexicon, the phones in words pronunciations are linked to the corresponding graphemes (characters) of the word. In this way, the phonetic confusion is established between couples of (SL phones, graphemes) and sequences of NL phones. We evaluated our approach on French, Italian, Spanish and Greek non-native speech databases. The spoken language is English. The modified ASR system achieved significant improvements ranging from 20.3% to 43.2% (relative) in sentence error rate and from 26.6% to 50.0% in WER.
This paper describes our recent work on the development of a large vocabulary, speaker-independent, continuous speech recognition system for Cantonese-English code-mixing utterances. The details of both acoustic modeling and language modeling will be discussed. For acoustic modeling, Cantonese accents in English words are handled by applying cross-lingual acoustic units, as well as modifications in pronunciation dictionary. Statistic language models are built from a small amount of text data, as there are many limitations on data collection. Language boundary detection based on language identification algorithms is applied, and it offers a slight improvement to the overall accuracy. The recognition accuracy for Chinese characters and English lexicons in the code-mixing utterances is 56.37% and 52.99%, respectively.
The ICSI+ multilingual sentence segmentation with results for English and Mandarin broadcast news automatic speech recognizer transcriptions represents a joint effort involving ICSI, SRI, and UT Dallas. Our approach is based on using hidden event language models for exploiting lexical information, and maximum entropy and boosting classifiers for exploiting lexical, as well as prosodic, speaker change and syntactic information. We demonstrate that the proposed methodology including pitch- and energy-related prosodic features performs significantly better than a baseline system that uses words and simple pause features only. Furthermore, the obtained improvements are consistent across both languages, and no language-specific adaptation of the methodology is necessary. The best results were achieved by combining hidden event language models with a boosting-based classifier that to our knowledge has not previously been applied for this task.
Previously, we proposed two voice-to-phoneme conversion algorithms for speaker-independent voice-tag creation specifically targeted at applications on embedded platforms, an environment sensitive to CPU and memory resource consumption [1]. These two algorithms (batch mode and sequential) were applied in a same-language context, i.e., both acoustic model training and voice-tag creation and application were performed on the same language.
Tone plays an important role in recognizing spoken tonal languages like Chinese. However, the F0 contour discontinuity between voiced and unvoiced segments has traditionally been a bottleneck in modeling tone contour for automatic speech recognition and synthesis and various heuristic approaches were proposed to get around the problem. The Multi-Space Distribution (MSD) was proposed by Tokuda et.al. and applied to HMM-based speech synthesis, which models the two probability spaces, discrete for unvoiced region and continuous for voiced F0 contour, in a linearly weighted mixture. We extend the MSD to tone modeling for speech recognition applications. Specifically, modeling tones in speaker-independent, spoken Chinese is formulated and tested in a Mandarin speech database. The tone features and spectral features are further separated into two streams and stream-dependent models are built to cluster the two features into separated decision trees. The recognition results show that the ultimate performance of tonal syllable error rate can be improved from toneless baseline system to the MSD based stream-dependent system, 50.5% to 36.1% and 46.3% to 35.1%, for the two systems resulted from using two different phone sets. The absolute tonal syllable error rate improvement of the new approach is 5.5% and 6.1%, comparing with the conventional tone modeling.
This paper presents a comparison of some different acoustic modeling strategies for under-resourced languages. When only limited speech data are available for under-resourced languages, we propose some crosslingual acoustic modeling techniques. We apply and compare these techniques in Vietnamese ASR. Since there is no pronunciation dictionary for some under-resourced languages, we investigate grapheme-based acoustic modeling. Some initialization techniques for context independent modeling and some question generation techniques for context dependent modeling are applied and compared for Khmer ASR.
Multiple accents are often present in spontaneous Chinese Mandarin speech as most Chinese have learned Mandarin as a second language. We propose a method to handle multiple accents as well as standard speech in a speaker-independent system by merging auxiliary accent decision trees with standard trees and reconstruct the acoustic model. In our proposed method, tree structures and shape are modified according to accent-specific data while the parameter set of the baseline model remains the same. The effectiveness of this approach is evaluated on Cantonese and Wu accented, as well as standard Mandarin speech. Our method yields a significant 4.4% and 3.3% absolute word error rate reduction without sacrificing the performance on standard Mandarin speech.
This paper compares and quantifies the differences between formants of speech across accents. The cross entropy information measure is used to compare the differences between the formants of the vowels of three major English accents namely British, American and Australian. An improved formant estimation method, based on a linear prediction (LP) model feature analysis and a hidden Markov model (HMM) of formants, is employed for estimation of formant trajectories of vowels and diphthongs. Comparative analysis of the formant space of the three accents indicates that these accents are mostly conveyed by the first two formants. The third and fourth formants exhibit some significant differences across accents for only a few phonemes most notably the variants of vowel r in the American (rhotic) accent compared to British (non-rhotic accent). The issue of speaker variability versus accent variability is examined by comparing the cross-entropies of speech models trained on different groups of speakers within and across the accents.
Phonetic differences always exist between any Chinese dialect and standard Chinese (Putonghua). In this paper, a method, named automatic dialect-specific Initial/Final (IF) generation, is proposed to deal with the issue of phonemic difference which can automatically produce the dialect-specific units based on model distance measure. A dialect-specific decision tree regrowing method is also proposed to cope with the tri-IF expansion due to the introduction of dialect-specific IFs (DIFs). In combination with a certain adaptation technique, the proposed methods can achieve a syllable error rate (SER) reduction of 18.5% for Shanghai-accented Chinese compared with the Putonghua-based baseline while the use of the DIF set only can lead to an SER reduction of 5.5%.
We propose a novel modeling framework for automatic diacritization of Arabic text. The framework is based on Markov modeling where each grapheme is modeled as a state emitting a diacritic (or none) from the diacritic space. This space is exactly defined using 13 diacritics and a null-diacritic and covers all the diacritics used in any Arabic text. The state emission probabilities are estimated using maximum entropy (MaxEnt) models. The diacritization process is formulated as a search problem where the most likely diacritization realization is assigned to a given sentence. We also propose a diacritization parse tree (DPT) for Arabic that allows joint representation of diacritics, graphemes, words, word contexts, morphologically analyzed units, syntactic (parse tree), semantic (parse tree), part-of-speech tags and possibly other information sources. The features used to train MaxEnt models are obtained from the DPT. In our evaluation we obtained 7.8% diacritization error rate (DER) and 17.3% word diacritization error rate (WDER) on a dialectal Arabic data using the proposed framework.
Grapheme based mono-, cross- and bilingual speech recognition of Czech and Slovak is presented in the paper. The training and testing procedures follow the MASPER initiative that was formed as a part of the COST 278 Action. All experiments were performed using Czech and Slovak SpeechDat-E databases. Grapheme-based models gave equivalent recognition performance compared to phoneme-based models in monolingual as well as bilingual case. Moreover bilingual SK-CZ speech recognition is equivalent to monolingual recognition, which indicates the possibility to share Czech and Slovak speech data for training bilingual grapheme-based acoustic models usable for recognition of Slovak as well as Czech. Also the promising results confirmed the presumption, that languages with a close graphemeto- phoneme relation are well suited for grapheme-based speech recognition.
Automatic labeling of prosodic events in speech has potentially significant implications for spoken language processing applications, and has received much attention over the years, especially after the introduction of annotation standards such as ToBI. Current labeling techniques are based on supervised learning, relying on the availability of a corpus that is annotated with the prosodic labels of interest in order to train the system. However, creating such resources is an expensive and time-consuming task. In this paper, we examine an unsupervised labeling algorithm for accent (prominence) and prosodic phrase boundary detection at the linguistic syllable level, and evaluate their performance on an standard, manually annotated corpus. We obtain labeling accuracies of 77.8% and 88.5% for the accent and boundary labeling tasks, respectively. These figures compare well against previously reported performance levels for supervised labelers.
In this paper, we describe a set of experiments that examine the correlation between energy and pitch accent. We tested the discriminative power of the energy component of frequency subbands with a variety of frequencies and bandwidths on read speech spoken by four native speakers of Standard American English, using an analysis by classification approach. We found that the frequency region most robust to speaker differences is between 2 and 20 bark. Across all speakers, using only energy features we were able to predict pitch accent in read speech with accuracy of 81.9%.
We formerly conducted emotional speech synthesis using our corpusbased method of generating fundamental frequency (F0) contours from text. The method predicts command values of F0 contour generation process model instead of directly predicting F0 value of each time frame. A better control of F0 contours was realized by taking the emotional level of each bunsetsu into account: adding information on which bunsetsu(s) the emotion is especially placed to the command predictor inputs. In the case of anger, F0 contours closer to the target contours are obtained by adding emotional levels. Speech synthesis was conducted by generating F0 contours in two ways: using commands predicted by taking emotional levels into account and those not. The result of perceptual experiment indicated that emotion was conveyed well by adding emotional levels.
The present experiment attempts to determine the role of prosody in the identification of stress unit boundaries in Czech, using three types of delexicalized stimuli evaluated by native speakers. After describing the relative contribution of intonation and duration to boundary perception, an analysis of misplaced boundaries is provided. Identified patterns concern especially the relationship between tonal structure and boundary salience, the order of preference between intonation and duration, and the tendency towards perceptual filling of accent lapses.
This paper proposes a novel parametric representation of mandarin intonation based on orthogonal polynomial approximation. The polynomial is a simplified representation of Parallel Encoding and Target Approximation (PENTA) intonation model that includes a target component and an approximation component. We also propose predicting the polynomial parameters from linguistic and phonetic attributes by generalized linear models (GLM). The optimal attributes are automatically selected by stepwise regression method. Thus both model structures and model coefficients are optimized in a totally data-driven manner. In addition, speaking rate is introduced as a new attribute for prediction. When the method is applied to intonation prediction of Mandarin speech, it achieves F0 RMSE of 30.21 Hz and correlation coefficients of 0.85 in open test. Informal perceptual experiments show that the predicted intonation is quite appropriate and natural.
Agreement was investigated among five labelers for the use of two prosodic annotation systems: the ToBI (Tones and Break Indices) system [1,2] and the RaP (Rhythm and Pitch) system [3]. Each system permits the labeling of pitch accents and two levels of phrasal boundaries; RaP also permits labeling of speech rhythm and distinguishes multiple levels of prominence on syllables. After training with computerized materials and getting expert feedback, coders applied each system to a corpus of read and spontaneous speech (36 minutes for ToBI and 19 for RaP). Inter-coder reliability was computed according to two metrics: transcriber-syllable-pairs and the kappa statistic. High agreement was obtained for both systems for pitch accent presence, pitch accent type, boundary presence, boundary type, and, for RaP, presence and strength of metrical prominences. Agreement levels for ToBI were similar to those of previous studies [4,5], indicating that participants were proficient coders. Moreover, the high level of agreement demonstrated for the RaP system indicates that RaP is a viable alternative to ToBI for prosodic labeling of large speech corpora.
In a statistical language model based automated directory assistance system, extracting the salient information from the recognition output can significantly increase the accuracy of the backend listing database search. In this paper, we describe a Hidden Markov model (HMM) based saliency parser that was developed to accurately and efficiently identify salient words from the recognition output by modeling both the syntactic structure as well as the lexical distribution. The parser can be trained using a relatively small data set with coarse syntactic class labels, without the need for detailed syntactic knowledge or a treebank-like corpus. Experimental results on a research corpus of directory assistance utterances betoken the parsers importance within the automated system. The results demonstrate that the proposed saliency parser can significantly improve the overall automation rate without increasing the error rate.