| Total: 116
Automatic pronunciation assessment has several difficulties. Adequacy in controlling the vocal organs is often estimated from the spectral envelopes of input utterances but the envelope patterns are also affected by other factors such as speaker identity. Recently, a new method of speech representation was proposed where these non-linguistic variations are effectively removed through modeling only the contrastive aspects of speech features. This speech representation is called speech structure. However, the often excessively high dimensionality of the speech structure can degrade the performance of structure-based pronunciation assessment. To deal with this problem, we integrate multilayer regression analysis with the structure-based assessment. The results show higher correlation between human and machine scores and also show much higher robustness to speaker differences compared to widely used GOP-based analysis.
In this paper we show how a pronunciation quality measure can be improved by making use of information on frequent pronunciation errors made by non-native speakers. We propose a new measure, called weighted Goodness of Pronunciation (wGOP), and compare it to the much used GOP measure. We applied this measure to the task of discriminating correctly from incorrectly realized Dutch vowels produced by non-native speakers and observed a substantial increase in performance when sufficient training material is available.
In this paper, we propose a novel speaker adaptation technique, regularized-MLLR, for Computer Assisted Language Learning (CALL) systems. This method uses the linear combination of a group of teachers transformation matrices to represent each target learners transformation matrix, thus avoids the over-adaptation problem that erroneous pronunciations come to be judged as good pronunciations after conventional MLLR speaker adaptation, which uses learners imperfect speech as target utterances of adaptation. Experiments of automatic scoring and error detection on public databases show that the pro-posed method outperforms conventional MLLR adaption in pronunciation evaluation and can avoid the problem of over adaptation.
In this paper, we propose a method for estimating a score for English pronunciation. Scores estimated by the proposed method were evaluated by correlating them with the teacher's pronunciation score.The average correlation between the estimated pronunciation scores and the teacher's pronunciation scores over 1, 5, and 10 sentences was 0.807, 0.873, and 0.921, respectively. When a text of spoken sentence was unknown, we obtained a correlation of 0.878 for 10 utterances. For English phonetic evaluation, we classified English phoneme pairs that are difficult for Japanese speakers to pronounce, using SVM, NN, and HMM classifiers. The correct classification ratios for native English and Japanese English phonemes were 94.6% and 92.3% for SVM, 96.5% and 87.4% for NN, 85.0% and 69.2% for HMM, respectively. We then investigated the relationship between the classification rate and a native English teacher's pronunciation score, and obtained a high correlation of 0.6 - 0.7.
We propose a novel decision tree based approach to Mandarin tone assessment. In most conventional computer assisted pronunciation training (CAPT) scenarios a tone production template is prepared as a reference with only numeric scores as feedbacks for tone learning. In contrast decision trees trained with an annotated tone-balanced corpus make use of a collection of questions related to important cues in categories of tone production. By traversing the corresponding paths and nodes associated with a test utterance a sequence of corrective comments can be generated to guide the learner for potential improvement. Therefore a detailed pronunciation indication or a comparison between two paths can be provided to learners which are usually unavailable in score-based CAPT systems.
In this paper, we describe the principle and functionality of the Computer-Assisted Stress Teaching and Learning Environment (CASTLE) that we have proposed and developed to help learners of English as a Second Language (ESL) to learn stress patterns of English language. There are three modules in the CASTLE system. The first module, individualised speech learning material providing module, can provide learners individualised speech material that possesses their preferred voice features, e.g., gender, pitch and speech rate. The second module, perception assistance module, is intended to help learners correctly perceive English stress patterns, which can automatically exaggerate the differences between stressed and unstressed syllables in a teachers voice. The third module, production assistance module, is developed to help learners to make aware of the rhythm of English language and provide learners feedback in order to improve their production of stress patterns.
Automatic evaluation of GOR (Goodness Of pRosody) is a more advanced and challenging task in CALL (Computer Aided Language Learning) system. Apart from traditional prosodic features, we develop a method based on multiple knowledge sources without any prior condition of reading text. After speech recognition, apart from most state-of-the-art features in prosodic analysis, we cultivate more concise and effective feature set from the generation of prosody based on Fujisaki model, and influence of tempo in prosodythe variability of prosodic components based on PVI method. We also propose methods of boosting training without any annotation by mining larger corpus. Results in experiment investigate the GOR score on 1297 speech samples of excellent group of Chinese students aging from 14-16, we can draw several conclusions: On the one hand, adding the knowledge sources from generation and impact of prosody can contribute to 1.76% reduction in EER and 0.036 promotion in correlation than prosodic features alone; On the other hand, final result can be considerably improved by boosting training approach and topic-dependent scheme.
We present a pronunciation error detection method for second language learners of English (L2 learners). The method is a combination of confidence scoring at the phone level and landmark-based Support Vector Machines (SVMs). Landmark-based SVMs were implemented to focus the method on targeting specific phonemes in which L2 learners make frequent errors. The method was trained on the phonemes that are difficult for Korean learners and tested on intermediate Korean learners. In the data where non-phonemic errors occurred in a high proportion, the SVM method achieved a significantly higher F-score (0.67) than confidence scoring (0.60). However, the combination of the two methods without the appropriate training data did not lead to improvement. Even for intermediate learners, a high proportion of errors (40%) was related to these difficult phonemes. Therefore, a method that is specialized for these phonemes would be beneficial for both beginners and intermediate learners.
In current text content especially web contents, there are many mixed language contents, i.e. Mandarin text mixed with English words. To make the synthesized speech of mixed language contents sound natural, we need to synthesize the mixed languages content with a single voice. However, this task is very challenging because we can hardly find a talent who can speak both languages well enough. The synthesized speech will sound unnatural if the HMM based TTS is directly built with the non-native speakers training corpus. In this paper, we propose to use speaker adaptation technology to leverage the native speakers data to generate more natural speech for the non-native speaker. Evaluation results show that the proposed method can significantly improve the speaker consistency and naturalness of synthesized speech for mixed language text.
This paper provides an in-depth analysis of the impacts of language mismatch on the performance of cross-lingual speaker adaptation. Our work confirms the influence of language mismatch between average voice distributions for synthesis and for transform estimation and the necessity of eliminating this mismatch in order to effectively utilize multiple transforms for cross-lingual speaker adaptation. Specifically, we show that language mismatch introduces unwanted language-specific information when estimating multiple transforms, thus making these transforms detrimental to adaptation performance. Our analysis demonstrates speaker characteristics should be separated from language characteristics in order to improve cross-lingual adaptation performance.
We are developing a real-time lecture transcription system for hearing impaired students in university classrooms. The automatic speech recognition (ASR) system is adapted to individual lecture courses and lecturers, to enhance the recognition accuracy. The ASR results are selectively corrected by a human editor, through a dedicated interface, before presenting to the students. An efficient adaptation scheme of the ASR modules has been investigated in this work. The system was tested for a hearing-impaired student in a lecture course on civil engineering. Compared with the current manual note-taking scheme offered by two volunteers, the proposed system generated almost double amount of texts with one human editor.
This paper describes an investigation into current browser based runtimes including Adobes Flash and Microsofts Silverlight as platforms for delivering web based speech interfaces. The key difference here is the browser plugin is used to perform all the computation without any server side processing. The first application is an HMM based text-to-speech engine running in the Adobe Flash plugin. The second application is a WFST based large vocabulary speech recognition decoder written in C# running inside the Silverlight plugin.
One of the difficulties in Language Recognition is the variability of the speech signal due to speakers and channels. If channel mismatch is too big and when different categories of channels can be identified, one possibility is to build a specific language recognition system for each category and then to fuse them together. This article uses a system selector that takes, for each utterance, the scores of one of the channel-category dependent systems. This selection is guided by a channel detector. We analyze different ways to design such channel detectors: based on cepstral features or on the Factor Analysis channel variability term. The systems are evaluated in the context of NIST's LRE 2009 and run at 1.65% min-Cavg for a subset of 8 languages and at 3.85% min-Cavg for the 23 language protocol.
This paper studies feature selection in phonotactic language recognition. The phonotactic feature is presented by n-gram statistics derived from one or more phone recognizers in the form of high dimensional feature vectors. Two feature selection strategies are proposed to select the n-gram statistics for reducing the dimension of feature vectors, so that higher order n-gram features can be adopted in language recognition. With the proposed feature selection techniques, we achieved equal error rates (EERs) of 1.84% with 4-gram statistics on the 2007 NIST Language Recognition Evaluation 30s closed test sets.
One successful approach to language recognition is to focus on the most discriminative high level features of languages, such as phones and words. In this paper, we applied a similar approach to acoustic features using a single GMM-tokenizer followed by discriminatively trained language models. A feature selection technique based on the Support Vector Machine (SVM) is used to model higher order n-grams. Three different ways to build this tokenizer are explored and compared using discriminative uni-gram and generative GMM-UBM. A discriminative uni-gram using very large GMM tokenizer with 24,576 components yields an EER of 1.66%, rising to 0.71% when fused with other acoustic approaches, on the NIST03 LRE 30s evaluation.
In common approaches to phonotactic language recognition, decodings are processed and scored in a fully uncoupled way, their time alignment being completely lost. Recently, we have presented a new approach to phonotactic language recognition which takes into account time alignment information, by considering cross-decoder co-occurrences of phones or phone n-grams at the frame level. In this work, the approach based on cross-decoder co-occurrences of phone n-grams is further developed and evaluated. Systems were built by means of open software and experiments were carried out on the NIST LRE2007 database. A system based on co-occurrences of phone n-grams (up to 4-grams) outperformed the baseline phonotactic system, yielding around 8% relative improvement in terms of EER. The best fused system attained 1,90% EER, which supports the use of cross-decoder dependencies for improved language modeling.
This paper presents a Variety IDentification (VID) approach and its application to broadcast news transcription for Portuguese. The phonotactic VID system, based on Phone Recognition and Language Modelling, focuses on a single tokenizer that combines distinctive knowledge about differences between the target varieties. This knowledge is introduced into a Multi-Layer Perceptron phone recognizer by training mono-phone models for two varieties as contrasting phone-like classes. Significant improvements in terms of identification rate were achieved compared to conventional single and fused phonotactic and acoustic systems. The VID system is used to select data to automatically train variety-specific acoustic models for broadcast news transcription. The impact of the selection is analyzed and variety-specific recognition is shown to improve results by up to 13% compared to a standard variety baseline.
In this paper, we introduce a new approach to dialect recognition which relies on the hypothesis that certain phones are realized differently across dialects. Given a speaker's utterance, we first obtain the most likely phone sequence using a phone recognizer. We then extract GMM Supervectors for each phone instance. Using these vectors, we design a kernel function that computes the similarities of phones between pairs of utterances. We employ this kernel to train SVM classifiers that estimate posterior probabilities, used during recognition. Testing our approach on four Arabic dialects from 30s cuts, we compare our performance to five approaches: PRLM; GMM-UBM; our own improved version of GMM-UBM which employs fMLLR adaptation; our recent discriminative phonotactic approach; and a state-of-the-art system: SDC-based GMM-UBM discriminatively trained. Our kernel-based technique outperforms all these previous approaches; the overall EER of our system is 4.9%.
We investigated the speech intelligibility differences of normal and bone-conduction stereo headphones of target speech localized at 45 degrees on the horizontal plane when competing noise is present. This was our effort to study the possible effect of crosstalk found in bone-conduction headphones on speech intelligibility. All sound sources were localized on the horizontal plane. Target speech was localized at 45 degrees diagonal relative to the listener, while the noise was localized at various azimuths and distances from the listener. The SNR was set to 0, -6, -12 [dB]. We found little difference in intelligibility by headphone types, suggesting that cross-talk in bone-conduction headphones have negligible effect on intelligibility.
Single-formant dynamically changing harmonic vowel analogs, a target with a single frequency excursion and a longer distracter with a different fundamental frequency and repeated excursions were generated to assess informational and energetic masking of target transitions in young and elderly listeners. Results indicate the presence of informational masking that is significant only for formant excursions of sub-phonemic changes. Elderly listeners perform similarly to the young, with the exception that they require a target/distracter ratio about 10 to 20 dB larger.
Time reversal is often used in experimental studies on language perception and understanding, but little is known on its precise impact on speech sounds. Strikingly, some studies consider reversed speech chunks as speech stimuli lacking lexical information while others use them as non speech control conditions. The phonetic perception of reversed speech has not been thoroughly studied so far, and only impressionistic evaluation has been proposed. To fill this gap, we give here the results of a phonetic transcription task of time-reversed French pseudo-words by 4 expert phoneticians. Results show that for most phonemes (except unvoiced stops), several phonetic features are preserved by time reversal, leading to rather accurate transcriptions of reversed words. Other phenomena are also investigated, such as the emergence of epenthetic segments, and discussed with insight from the neurocognitive bases of the perception of time-varying sounds.
This study investigates the role of pitch reset in discourse boundary perception. Previous production studies showed that pitch reset is a robust correlate of discourse boundaries. It not only signals boundary location, but also reflects boundary sizes. In this study, one aims to investigate how listeners perceive and utilize this cue for boundary detection. Results showed that listeners perception on this cue corresponded to the patterns found in speech production. What is more, evidence showed that what listeners rely on is the amount of reset, rather than the rest pitch height.
This study tested the use of binaural cues in adult dyslexic listeners during speech-in-noise comprehension. Participants listened to words presented in three different noise-types (Babble-, Fluctuating- and Stationary-noise) in three different listening configurations: dichotic, monaural and binaural. In controls, we obtained an important informational masking in the monaural configuration mostly attributable to linguistic interferences. This was not observed with binaural noise, suggesting that this interference was suppressed by spatial separation. Dyslexic listeners showed a monaural deficit in Babble, but no deficit of the binaural processing, suggesting compensation based on the use of spatial cues.
Adults have been shown to categorize an ambiguous syllable differently depending on which sound precedes it. The present paper reports preliminary results from an on-going experiment, investigating seven- to nine-month-olds on their sensitivity to non-speech contexts when perceiving an ambiguous syllable. The results suggest that the context effect is present already in infancy. Additional data is currently collected and results will be presented in full at the conference.
Speech is a rapidly varying signal. Temporal processing generally slows with age and many older adults experience difficulties in understanding speech. This research involved over 250 young, middle-aged and older listeners. Temporal processing abilities were assessed in numerous vowel sequence tasks, and analyses examined several factors that might contribute to performance. Significant factors included age and cognitive function as measured by the WAIS-III, but not hearing status for the audible vowels. In addition, learning effects were assessed by retesting two tasks. All groups significantly improved vowel temporal-order identification to a similar degree, but large differences in performance between groups were still observed.