INTERSPEECH.2009 - Others

Total: 358

#1 Feedforward control of a 3d physiological articulatory model for vowel production [PDF] [Copy] [Kimi1]

Authors: Qiang Fang ; Akikazu Nishikido ; Jianwu Dang ; Aijun Li

A 3D Physiological articulatory model has been developed to account for the biomechanical properties of speech organs in speech production. To control the model for investigating the mechanism of speech production, a feedforward control strategy is necessary to generate proper muscle activations according to desired articulatory targets. In this paper, we elaborated a feedforward control module for the 3D physiological articulatory model. In the feedforward control process, an input articulatory target, specified by articulatory parameters, is transformed to intrinsic representation of articulation; then, a muscle activation pattern by a proposed mapping function. The results show that the proposed feedforward control strategy is able to control the proposed 3D physiological articulatory model with high accuracy both acoustically and articulatorily.

#2 Articulatory modeling based on semi-polar coordinates and guided PCA technique [PDF] [Copy] [Kimi1]

Authors: Jun Cai ; Yves Laprie ; Julie Busset ; Fabrice Hirsch

Research on 2-dimensional static articulatory modeling has been performed by using the semi-polar system and the guided PCA analysis of lateral X-ray images of vocal tract. The density of the grid lines in the semi-polar system has been increased to have a better descriptive precision. New parameters have been introduced to describe the movements of tongue apex. An extra feature, the tongue root, has been extracted as one of the elementary factors in order to improve the precision of tongue model. New methods still remain to be developed for describing the movements of tongue apex.

#3 Sequencing of articulatory gestures using cost optimization [PDF] [Copy] [Kimi1]

Authors: Juraj Simko ; Fred Cummins

Within the framework of articulatory phonology (AP), gestures function as primitives, and their ordering in time is provided by a gestural score. Determining how they should be sequenced in time has been something of a challenge. We modify the task dynamic implementation of AP, by defining tasks to be the desired positions of physically embodied end effectors. This allows us to investigate the optimal sequencing of gestures based on a parametric cost function. Costs evaluated include precision of articulation, articulatory effort, and gesture duration. We find that a simple optimization using these costs results in stable gestural sequences that reproduce several known coarticulatory effects.

#4 From experiments to articulatory motion - a three dimensional talking head model [PDF] [Copy] [Kimi1]

Authors: Xiao Bo Lu ; William Thorpe ; Kylie Foster ; Peter Hunter

The goal of this study is to develop a customised computer model that can accurately represent the motion of vocal articulators during vowels and consonants. Models of the articulators were constructed as Finite Element (FE) meshes based on digitised high-resolution MRI (Magnetic Resonance Imaging) scans obtained during quiet breathing. Articulatory kinematics during speaking were obtained by EMA (Electromagnetic Articulography) and video of the face. The movement information thus acquired was applied to the FE model to provide jaw motion, modeled as a rigid body, and tongue, cheek and lip movements modeled with a free-form deformation technique. The motion of the epiglottis has also been considered in the model.

#5 Towards robust glottal source modeling [PDF] [Copy] [Kimi1]

Authors: Javier Pérez ; Antonio Bonafonte

We present here a new method for the simultaneous estimation of the derivative glottal waveform and the vocal tract filter. The algorithm is pitch-synchronous and uses overlapping frames of several glottal cycles to increase the robustness and quality of the estimation. Two parametric models for the glottal waveform are used: the KLGLOTT88 during the convex optimization iteration, and the LF model for the final parametrization. We use a synthetic corpus using real data published in several studies to evaluate the performance. A second corpus has been specially recorded for this work, consisting of isolated vowels uttered with different voice qualities. The algorithm has been found to perform well with most of the voice qualities present in the synthetic data-set in terms of glottal waveform matching. The performance is also good with the real vowel data-set in terms of resynthesis quality.

#6 Sliding vocal-tract model and its application for vowel production [PDF] [Copy] [Kimi1]

Author: Takayuki Arai

In a previous study, Arai implemented a sliding vocal-tract model based on Fant's three-tube model and demonstrated its usefulness for education in acoustics and speech science. The sliding vocal-tract model consists of a long outer cylinder and a short inner cylinder, which simulates tongue constriction in the vocal tract. This model can produce different vowels by sliding the inner cylinder and changing the degree of constriction. In this study, we investigated the model's coverage of vowels on the vowel space and explored its application for vowel production in the speech and hearing sciences.

#7 Minimum hypothesis phone error as a decoding method for speech recognition [PDF] [Copy] [Kimi1]

Authors: Haihua Xu ; Daniel Povey ; Jie Zhu ; Guanyong Wu

In this paper we show how methods for approximating phone error as normally used for Minimum Phone Error (MPE) discriminative training, can be used instead as a decoding criterion for lattice rescoring. This is an alternative to Confusion Networks (CN) which are commonly used in speech recognition. The standard (Maximum A Posteriori) decoding approach is a Minimum Bayes Risk estimate with respect to the Sentence Error Rate (SER); however, we are typically more interested in the Word Error Rate (WER). Methods such as CN and our proposed Minimum Hypothesis Phone Error (MHPE) aim to get closer to minimizing the expected WER. Based on preliminary experiments we find that our approach gives more improvement than CN, and is conceptually simpler.

#8 Posterior-based out of vocabulary word detection in telephone speech [PDF] [Copy] [Kimi1]

Authors: Stefan Kombrink ; Lukáš Burget ; Pavel Matějka ; Martin Karafiát ; Hynek Hermansky

In this paper we present an out-of-vocabulary word detector suitable for English conversational and read speech. We use an approach based on phone posteriors created by a Large Vocabulary Continuous Speech Recognition system and an additional phone recognizer, that allows detection of OOV and misrecognized words. In addition, the recognized word output can be transcribed more detailed using several classes. Reported results are on CallHome English and Wall Street Journal data.

#9 Automatic transcription system for meetings of the Japanese national congress [PDF] [Copy] [Kimi1]

Authors: Yuya Akita ; Masato Mimura ; Tatsuya Kawahara

This paper presents an automatic speech recognition (ASR) system for assisting meeting record creation of the National Congress of Japan. The system is designed to cope with spontaneous characteristics of meeting speech, as well as a variety of topics and speakers. For acoustic model, minimum phone error (MPE) training is applied with several normalization techniques. For language model, we have proposed statistical style transformation to generate spoken-style N-grams and their statistics. We also introduce statistical modeling of pronunciation variation in spontaneous speech. The ASR system was evaluated on real congressional meetings, and achieved word accuracy of 84%. It is also suggested that the ASR-based transcripts with this accuracy level is usable for editing meeting records.

#10 Cross-language bootstrapping for unsupervised acoustic model training: rapid development of a Polish speech recognition system [PDF] [Copy] [Kimi1]

Authors: Jonas Lööf ; Christian Gollan ; Hermann Ney

This paper describes the rapid development of a Polish language speech recognition system. The system development was performed without access to any transcribed acoustic training data. This was achieved through the combined use of cross-language bootstrapping and confidence based unsupervised acoustic model training. A Spanish acoustic model was ported to Polish, through the use of a manually constructed phoneme mapping. This initial model was refined through iterative recognition and retraining of the untranscribed audio data.

#11 Porting an european portuguese broadcast news recognition system to brazilian portuguese [PDF] [Copy] [Kimi1]

Authors: Alberto Abad ; Isabel Trancoso ; Nelson Neto ; M. Céu Viana

This paper reports on recent work in the context of the activities of the PoSTPort project aimed at porting a Broadcast News recognition system originally developed for European Portuguese to other varieties. Concretely, in this paper we have focused on porting to Brazilian Portuguese. The impact of some of the main sources of variability has been assessed, besides proposing solutions at the lexical, acoustic and syntactic levels. The ported Brazilian Portuguese Broadcast News system allowed a drastic performance improvement from 56.6% WER (obtained with the European Portuguese system) to 25.5%.

#12 Modeling northern and southern varieties of dutch for STT [PDF] [Copy] [Kimi1]

Authors: Julien Despres ; Petr Fousek ; Jean-Luc Gauvain ; Sandrine Gay ; Yvan Josse ; Lori Lamel ; Abdel Messaoudi

This paper describes how the Northern (NL) and Southern (VL) varieties of Dutch are modeled in the joint Limsi-Vecsys Research speech-to-text transcription systems for broadcast news (BN) and conversational telephone speech (CTS). Using the Spoken Dutch Corpus resources (CGN), systems were developed and evaluated in the 2008 N-Best benchmark. Modeling techniques that are used in our systems for other languages were found to be effective for the Dutch language, however it was also found to be important to have acoustic and language models, and statistical pronunciation generation rules adapted to each variety. This was in particular true for the MLP features which were only effective when trained separately for Dutch and Flemish. The joint submissions obtained the lowest WERs in the benchmark by a significant margin.

#13 Relative importance of formant and whole-spectral cues for vowel perception [PDF] [Copy] [Kimi1]

Authors: Masashi Ito ; Keiji Ohara ; Akinori Ito ; Masafumi Yano

Three psycho-acoustical experiments were carried out to investigate relative importance of formant frequency and whole spectral shape as cues for vowel perception. Four types of vowel-like signals were presented to eight listeners. The mean responses for stimuli including both formant and amplitude-ratio feature were quite similar to those for the stimuli including only formant peak feature. Nonetheless reasonable vowel changes were observed in responses for stimuli including only amplitude-ratio feature. The perceived vowel changes were also observed even for stimuli including neither of these features. The results suggested that perceptual cues were involved in various parts of vowel spectrum.

#14 Influences of vowel duration on speaker-size estimation and discrimination [PDF] [Copy] [Kimi1]

Authors: Chihiro Takeshima ; Minoru Tsuzaki ; Toshio Irino

Several experimental studies have shown that the human auditory system has a mechanism for extracting speaker-size information, using sufficiently long sounds. This paper investigated influence of vowel duration on the processing for size extraction using short vowels. In a size estimation experiment, listeners subjectively estimated the size (height) of the speaker for isolated vowels. The results showed that listenersf perception of speaker size was highly correlated with the factor of vocal-tract length in all the tested durations (from 16 ms to 256 ms). In a size discrimination experiment, listeners were presented with two vowels scaled the vocal-tract length and were asked which vowel was perceived to be spoken by a smaller speaker. The results showed that the just-noticeable differences (JNDs) in speaker size were almost the same for the durations longer than 32 ms. However, the JNDs rose considerably for 16-ms duration. These observations of the experiments suggest that the auditory system can extract speaker-size information even for 16-ms vowels although the precision of size extraction would deteriorate when the duration becomes less than 32 ms.

#15 High front vowels in Czech: a contrast in quantity or quality? [PDF] [Copy] [Kimi1]

Authors: Václav Jonáš Podlipský ; Radek Skarnitzl ; Jan Volín

We investigate the perception and production of Czech /I/ and /i:/, a contrast traditionally described as quantitative. First, we show that the spectral difference between the vowels is for many Czechs as strong a cue as (or even stronger than) duration. Second, we test the hypothesis that this shift towards vowel quality as a perceptual cue for this contrast resulted in weakening of the durational differentiation in production. Our measurements confirm this: members of the /I/-/i:/ pair differed in duration much less than those of other short-long pairs. We interpret these findings in terms of Lindblom's H&H theory.

#16 Effect of contralateral noise on energetic and informational masking on speech-in-speech intelligibility [PDF] [Copy] [Kimi1]

Authors: Marjorie Dole ; Michel Hoen ; Fanny Meunier

This experiment tested the advantage of binaural presentation of an interfering noise in a task involving identification of monaurallypresented words. These words were embedded in three types of noise: a stationary noise, a speech-modulated noise and a speechbabble noise, in order to assess energetic and informational masking contributions to binaural unmasking. Our results showed important informational masking in the monaural condition, principally due to lexical and phonetic competition. We also found a binaural unmasking effect, which was more important when speech was used as interferer, suggesting that this suppressive effect was more efficient in the case of high-level informational (lexical and phonetic) competition.

#17 Using location cues to track speaker changes from mobile, binaural microphones [PDF] [Copy] [Kimi1]

Authors: Heidi Christensen ; Jon Barker

This paper presents initial developments towards computational hearing models that move beyond stationary microphone assumptions. We present a particle filtering based system for using localisation cues to track speaker changes in meeting recordings. Recording are made using in-ear binaural microphones worn by a listener whose head is constantly moving. Tracking speaker changes requires simultaneously inferring the perceiver's head orientation, as any change in relative spatial angle to a source can be caused by either the source moving or the microphones moving. In real applications, such as robotics, there may be access to external estimates of the perceiver's position. We investigate the effect of simulating varying degrees of measurement noise in an external perceiver position estimate. We show that only limited self-position knowledge is needed to greatly improve the reliability with which we can decode the acoustic localisation cues in the meeting scenario.

#18 A perceptual investigation of speech transcription errors involving frequent near-homophones in French and american English [PDF] [Copy] [Kimi1]

Authors: Ioana Vasilescu ; Martine Adda-Decker ; Lori Lamel ; Pierre Hallé

This article compares the errors made by automatic speech recognizers to those made by humans for near-homophones in American English and French. This exploratory study focuses on the impact of limited word context and the potential resulting ambiguities for automatic speech recognition (ASR) systems and human listeners. Perceptual experiments using 7-gram chunks centered on incorrect or correct words output by an ASR system, show that humans make significantly more transcription errors on the first type of stimuli, thus highlighting the local ambiguity. The long-term aim of this study is to improve the modeling of such ambiguous items in order to reduce ASR errors.

#19 The role of glottal pulse rate and vocal tract length in the perception of speaker identity [PDF] [Copy] [Kimi1]

Authors: Etienne Gaudrain ; Su Li ; Vin Shen Ban ; Roy D. Patterson

In natural speech, for a given speaker, vocal tract length (VTL) is effectively fixed whereas glottal pulse rate (GPR) is varied to indicate prosodic distinctions. This suggests that VTL will be a more reliable cue for identifying a speaker than GPR. It also suggests that listeners will accept larger changes in GPR before perceiving speaker change. We measured the effect of GPR and VTL on the perception of a speaker difference, and found that listeners hear different speakers given a VTL difference of 25%, but they require a GPR difference of 45%.

#20 Development of voicing categorization in deaf children with cochlear implant [PDF] [Copy] [Kimi1]

Authors: Victoria Medina ; Willy Serniclaes

Cochlear implant (CI) improves hearing but communication abilities still depend on several factors. The present study assesses the development of voicing categorization in deaf children with cochlear implant, examining both categorical perception (CP) and boundary precision (BP) performances. We compared 22 implanted children to 55 normal-hearing children using different age factors. The results showed that the development of voicing perception in CI children is fairly similar to that in normal-hearing controls with the same auditory experience and irrespective of differences in the age of implantation (two vs. three years of age).

#21 Processing liaison-initial words in native and non-native French: evidence from eye movements [PDF] [Copy] [Kimi1]

Author: Annie Tremblay

French listeners have no difficulty recognizing liaison-initial words. This is in part because acoustic/phonetic information distinguishes liaison consonants from (non-resyllabified) word onsets in the speech signal. Using eye tracking, this study investigates whether native speakers of English, a language that does not have a phonological resyllabification process like liaison, can develop target-like segmentation procedures for recognizing liaison-initial words in French, and if so, how such procedures develop with increasing proficiency.

#22 Estimating the potential of signal and interlocutor-track information for language modeling [PDF] [Copy] [Kimi1]

Authors: Nigel G. Ward ; Benjamin H. Walker

Although today most language models treat language purely as word sequences, there is recurring interest in tapping new sources of information, such as disfluencies, prosody, the interlocutorfs dialog act, and the interlocutor's recent words. In order to estimate the potential value of such sources of information, we extend Shannon's guessing-game method for estimating entropy to work for spoken dialog. Four teams of two subjects each predicted the next word in a dialog using various amounts of context: one word, two words, all the words spoken so far, or the full dialog audio so far. The entropy benefit in the full-audio condition over the full text condition was substantial, .64 bits per word, greater than the .54 bit benefit of full text context over trigrams. This suggests that language models may be improved by use of the prosody of the speaker and context from the interlocutor.

#23 Effect of r-resonance information on intelligibility [PDF] [Copy] [Kimi1]

Authors: Antje Heinrich ; Sarah Hawkins

We investigated the importance of phonetic information in preceding syllables for the intelligibility of minimal-pair words containing /r/ or /l/. Target words were cross-spliced into a different token of the same sentence (match) or into a sentence that was identical but originally contained the paired word (mismatch). Young and old adults heard the sentences, casually or carefully spoken, in cafeteria or 12-talker babble. Matched phonetic information in the syllable immediately before the target segment, and in earlier syllables, facilitated intelligibility of r- but not l-words. Despite hearing loss, older adults also used this phonetic information.

#24 Perception of temporal cues at discourse boundaries [PDF] [Copy] [Kimi1]

Authors: Hsin-Yi Lin ; Janice Fon

This study investigates the role of temporal cues in the perception at discourse boundaries. Target cues were penult lengthening, final lengthening, and pause duration. Results showed that different cues are weighted differently for different purposes. Final lengthening is more important for subjects to detect boundaries, while pause duration is more responsible in cuing the boundary sizes.

#25 Human audio-visual consonant recognition analyzed with three bimodal integration models [PDF] [Copy] [Kimi1]

Authors: Zhanyu Ma ; Arne Leijon

With A-V recordings, ten normal hearing people took recognition tests at different signal-to-noise ratios (SNR). The A-V recognition results are predicted by the fuzzy logical model of perception (FLMP) and the post-labelling integration model (POSTL). We also applied hidden Markov models (HMMs) and multi-stream HMMs (MSHMMs) for the recognition. As expected, all the models agree qualitatively with the results that the benefit gained from the visual signal is larger at lower acoustic SNRs. However, the FLMP severely overestimates the A-V integration result, while the POSTL model underestimates it. Our automatic speech recognizers integrated the audio and visual stream efficiently. The visual automatic speech recognizer could be adjusted to correspond to human visual performance. The MSHMMs combine the audio and visual streams efficiently, but the audio automatic speech recognizer must be further improved to allow precise quantitative comparisons with human audio-visual performance.