INTERSPEECH.2004 | Cool Papers - Immersive Paper Discovery

#1 From decoding-driven to detection-based paradigms for automatic speech recognition [PDF] [Copy] [Kimi] [REL]

We present a detection-based automatic speech recognition (ASR) paradigm that is capable of integrating both the knowledge sources accumulated in the speech science community and the modeling techniques established in the speech processing community. By exploring this new framework, we expect that researchers in the Interspeech community can collaboratively contribute to developing next generation algorithms that have the potential to surpass current capabilities, and go beyond the limitations of the state-of-the-art ASR technologies.

#2 In search of a universal phonetic alphabet - theory and application of an organic visible speech- [PDF¹] [Copy] [Kimi] [REL]

Author: Hyun-Bok Lee

Phonetic symbols have an important role to play in phonetics, linguistics, language teaching, speech pathology and speech sciences in general, and linguists and phoneticians have tried to devise appropriate phonetic alphabets. Notable among them are Sweet, Bell, Jespersen, Pike, etc. The most successful and popular phonetic alphabet today is no doubt the International Phonetic Alphabet. The International Korean Phonetic Alphabet (IKPA for short) is a system of phonetic symbols that has been devised by the author on the basis of the articulatory phonetic(organic) principles exploited by the Korean King Sejong in creating the Korean alphabet of 28 letters in 1443. The Korean alphabet is not merely a phonetic alphabet of arbitrary nature but a highly sophisticated system consisting of sets of interrelated organic phonetic symbols, each set representing either the shape of the organs of speech, i.e. lips, tooth and velar etc. or their articulatory action. The Korean alphabet is, in a true sense of the word, a set of phonetic symbols designed to represent the organic visible speech of the human being. The author has applied the organic phonetic principles much more extensively and systematically in devising IKPA than the King had done. Consequently the IKPA symbols are just as systematic, scientific, easy to learn and memorize as the Korean alphabet, quite unlike the IPA counterparts which, having been derived mainly from Roman and Greek letters, are mostly unsystematic and arbitrary. The IKPA symbols visualize or mirror the actual speech organs or their action and thus tell us exactly what sort of an articulatory action is involved in producing sounds. It is in this sense that the IKPA deserves to be called a "Universal Visible Speech", which is to be shared by all.

#3 From X-ray or MRU data to sounds through articulatory synthesis: towards an integrated view of the speech communication process [PDF] [Copy] [Kimi] [REL]

Author: Jacqueline Vaissière

This tutorial presents an integrated method to simulate the transfer from X-ray (or MRI) data to acoustics and finally to sounds. It illustrates the necessity of an articulatory model (hereby Maedas model) so as to: Construct realistic stimuli (sounds that human beings could really produce) for psychoacoustic experiments. "hear" what kind of sounds the vocal tract of a man or a woman, of a new-born or a monkey could produce and inversely, what vocal shapes could produce a sound with given acoustic characteristics. Study the correlation between the observed subtle articulatory and acoustic differences and the choices of preferred prototypes in the realisation and perception of the same API symbol by native speakers of different languages. Modelise vowels and consonants in context, and differentiate between transitional gestures which are necessary in a co-articulation process, but not essential in order to differentiate phonemes. Simulate the acoustic and perceptual consequences of the articulatory deformation realized by the singers (e.g. singing formant), or in case of pathological voices.

#4 Correlation between VOT and F0 in the perception of Korean stops and affricates [PDF] [Copy] [Kimi] [REL]

Author: Midam Kim

This research examines the trading correlation between VOT and F0 in the production and perception of the three-way distinction of Korean stops and affricates, namely, lenis, aspirated, and fortis, in the word-initial position. For this research, I conducted production and perception tests. For the production test, two female and two male native speakers of Seoul Korean recorded a monosyllabic word list including /ka, kha, k*a, pa, pha, p*a, ta, tha, t*a, ca, cha, c*a/ 15 times in random order. On VOT-F0 plains, the results showed that lenis, aspirated and fortis were discriminated with the two cues of VOT and F0, without overlapping. The results of a MANOVA test showed that there is a significant difference in lenis, aspirated, and fortis with correlation between VOT and F0 (p<0.001). In the perception test, the stimuli were made by manipulating the sound files recorded in the production test in such a way that F0 values were heightened or lowered at 10 Hz intervals, fixing VOT values. 14 subjects (seven females and seven males) participated in the perception test. The results showed that more than 94% of all fortis stimuli were not influenced by F0 changes, and that VOT and F0 values at the lenis-aspirated boundary showed strong negative correlation (r=<-0.923). From these results, I concluded that: 1) Lenis, aspirated, and fortis of Korean word-initial consonants are distinguished with the correlation of VOT and F0; 2) F0 does not function as an acoustic cue in the perception of Korean fortis; and 3) There is a phonetic trade-off between VOT and F0 in the distinction of Korean lenis and aspirated stops and affricates.

#5 The development of anticipatory labial coarticulation in French: a pionering study [PDF] [Copy] [Kimi] [REL]

Authors: Aude Noiray ; Lucie Menard ; Marie-Agnes Cathiard ; Christian Abry ; Christophe Savariaux

This article reports an experimental study initiated on labial anticipatory coarticulation in the framework of a description of motor control development. Four French children between 4 and 8 years old have been audio-visually recorded, uttering [iCny] puppets names, in which Cn corresponds to a varying number of intervocalic consonants. A kinetic lip area function has been obtained via the ICP tracking system (Lallouache [1]) in order to describe the anticipatory movements of vocalic targets. As several concurrent models have been evaluated to account for anticipation in adults but very few for children, and since the more robust has proved for French to be the Movement Expansion Model (MEM; Abry & Lallouache [2], [3], [4], Abry et al. [5]) we adopted this framework for testing the anticipatory motor behaviour in children.

#6 Speech recognition, sylabification and statistical phonetics [PDF] [Copy] [Kimi] [REL]

Author: Melvyn John Hunt

The classical approach in phonetics of careful observation of individual utterances can, this paper contends, be usefully augmented with automatic statistical analyses of large amounts of speech. Such analyses, using methods derived from speech recognition, are shown to quantify several known phonetic phenomena, most of which require syllable structure to be taken into account, and reveal some apparently new phenomena. Practical speech recognition normally ignores syllable structure. This paper presents quantitative evidence that prevocalic and postvocalic consonants behave differently. It points out some ways in which current speech recognition can be improved by taking syllable boundaries into account.

#7 Data-driven approaches for automatic detection of syllable boundaries [PDF] [Copy] [Kimi] [REL]

Author: Jilei Tian

Syllabification is an essential component of many speech and language processing systems. The development of automatic speech recognizers frequently requires working with subword units such as syllables. More importantly, syllabification is an inevitable part of speech synthesis system. In this paper we present data-driven approaches to supervised learning and automatic detection of syllable boundaries. The generalization capability of the learning is investigated on the assignment of syllable boundaries to phoneme sequence representation in English. A rule-based self-correction algorithm is also proposed to automatically correct some syllabification errors. We conducted a series of experiments and the neural network approach is clearly better in terms of generalization performance and complexity.

#8 Phonemic repertoire and similarity within the vocabulary [PDF] [Copy] [Kimi] [REL]

Authors: Anne Cutler ; Dennis Norris ; Nuria Sebastian-Galles

Language-specific differences in the size and distribution of the phonemic repertoire can have implications for the task facing listeners in recognising spoken words. A language with more phonemes will allow shorter words and reduced embedding of short words within longer ones, decreasing the potential for spurious lexical competitors to be activated by speech signals. We demonstrate that this is the case via comparative analyses of the vocabularies of English and Spanish. A language which uses suprasegmental as well as segmental contrasts, however, can substantially reduce the extent of spurious embedding.

#9 Boostrapping phonetic lexicons for new languages [PDF] [Copy] [Kimi] [REL]

Authors: Sameer Maskey ; Alan Black ; Laura Tomokiya

Although phonetic lexicons are critical for many speech applications, the process of building one for a new language can take a significant amount of time and effort. We present a bootstrapping algorithm to build phonetic lexicons for new languages. Our method relies on a large amount of unlabeled text, a small set of 'seed words' with their phonetic transcription, and the proficiency of a native speaker in correctly inspecting the generated pronunciations of the words. The method proceeds by automatically building Letter-to-Sound (LTS) rules from a small set of the most commonly occurring words in a large corpus of a given language. These LTS rules are retrained as new words are added to the lexicon in an Active Learning step. This procedure is repeated until we have a lexicon that can predict the pronunciation of any word in the target language with the accuracy desired. We tested our approach for three languages: English, German and Nepali.

#10 Lexical representation of non-native phonemes [PDF] [Copy] [Kimi] [REL]

Authors: Mirjam Broersma ; K. Marieke Kolkman

This study investigates whether the inaccurate processing of non-native phonemes leads to a not native-like representation of word forms containing these phonemes. Dutch and English listeners' processing of two English vowels and four plosives was studied in a phoneme monitoring experiment. The processing of difficult to identify non-native phonemes was compared to the processing of easy to identify ones. One of the vowels was difficult and the other easy to identify for Dutch listeners. The plosives were easy in word-initial and word-medial position and difficult to identify in word-final position for Dutch listeners. Lexical mediation was found to play a similar role for Dutch and English listeners, and there were no differences in the amount of lexical mediation for 'difficult' and 'easy' phonemes for Dutch listeners. This suggests that the inaccurate processing of non-native phonemes does not necessarily lead to a not native-like representation of word forms containing these phonemes.

#11 A comparative study on the production of inter-stress intervals of English speech by English native speakers and Korean speakers [PDF] [Copy] [Kimi] [REL]

Authors: Jong-Pyo Lee ; Tae-Yeoub Jang

This study attempts to compare the Inter-Stress Interval (ISI) patterns of English between the native speakers of English and Korean. One of the invariable results of the experiments about English speech rhythm has been that the strict concept of isochronism did not seem to exist at least in the surface phonetic level. However, the remarkable difference shown from the production experiment of the present study suggests that distinction in language rhythms, especially between English and Korean, be apparent. While the English native speakers and the proficient Korean speakers of English consistently produce rather shorter increase in ISI duration as the number of unstressed syllables located between target stressed syllables increased, the non-proficient Korean speakers of English produce a little longer one. The position of an ISI in a sentence does not seem at the moment to affect critically the duration of the ISI.

#12 Articulatory correlates of voice qualities of god guys and bad guys in Japanese anime: an MRI study [PDF] [Copy] [Kimi] [REL]

Authors: Emi Zuiki Murano ; Mihoko Teshigawara

This paper examines the articulatory correlates of the Hero and Villain Voice Types, which were auditorily identified in a separate study on cartoon voices, using the MRI technique. The MRI images were in agreement with the previous auditory analysis results; the major difference between articulatory postures of heroes and those of villains and between two villainous voice types was found in the supraglottal states and the pharyngeal cavity. Auditory analysis can be as valid as any other analysis method depending on the level of training in a commonly accepted system such as Laver's framework for voice quality description. However, the MRI technique also allowed us to see what would not be observed otherwise, e.g., larynx height, pharyngeal cavity, vocal tract length, and the position of the hyoid bone. Auditory and physiological methods should be used in combination in order to further our understanding of the larynx and the pharynx.

#13 Effects of phonetic contexts on the duration of phonetic segments in fluent read speech [PDF] [Copy] [Kimi] [REL]

Author: Sorin Dusan

Coarticulation is an important phenomenon that affects the realization of phonetic segments. The effects of coarticulation are prominent in both spectral and temporal domains. Various durational effects of phonetic contexts on the adjacent phonetic segments have been previously reported based on individual distinctive features (e.g., voiced stops lengthen and unvoiced stops shorten the preceding vowels) or specific contexts (e.g., both /s/ and /p/ are shorter in a /sp/ cluster). This paper presents a comprehensive method for analyzing the phonetic context effects of all phonetic segments on the duration of their preceding or succeeding adjacent phonetic segments in fluent read speech, using the TIMIT American English corpus. Statistical methods are employed to analyze the variations in mean durations of all phonetic segments as functions of preceding or succeeding phonetic identities. 99% confidence intervals for the mean durations are also presented to reveal which pairs of phonetic contexts present statistically significant differences.

#14 A study on nasal coda los in continuous speech [PDF] [Copy] [Kimi] [REL]

Author: Qiang Fang

In this study, statistical analysis is used to investigate nasal coda loss in spoken Standard Chinese. In order to find out the factors that influence nasal coda loss, we take into account the segmental and supra-segmental features and their interactions. We find out that articulation manner, post nasal coda boundary and tone influenced the nasal coda loss significantly, the interaction between tone and stress, tone and post boundary, articulation manner and tone are significant too.

#15 An improved pair-wise variability index for comparing the timing characteristics of speech [PDF] [Copy] [Kimi] [REL]

Author: Hua-Li Jian

The pair-wise variability index has become a useful and widely used tool for comparing syllable timing of speech. In this paper we present an improved pair-wise variability index based on median instead of means that can more strongly amplify and reveal the differences in the timing characteristics of two datasets. Further, it places less stringent requirements on the pre-processing of the measurements, and it is therefore more robust to outliers in the dataset. The effectiveness of the new measure is demonstrated through an example where the measure is applied to data based on American English speech and Taiwan English speech. The results obtained with the improved pair-wise variability index are compared to those of the standard pair-wise variability index and the rhythm ratio.

#16 An acoustic study of speech rhythm in taiwan English [PDF] [Copy] [Kimi] [REL]

Author: Hua-Li Jian

American English and Taiwan English have been found to exhibit different rhythmic patterns. American English is often described as a stress-timed language, while it has been suggested that Taiwan English is a syllable-timed language. This paper addresses the differences between these two varieties of English through an acoustic study. In particular the F1/F2 formant space is investigated. The results show that reduced vowels in American English are more concentrated in the F1/F2 formant space than reduced vowels in Taiwan English, which are more dispersed.

#17 Language specific phonetic rules: evidence from domain-initial strengthening [PDF] [Copy] [Kimi] [REL]

Author: Sung-A Kim

This paper investigates the domain-initial strengthening in English and Hamkyeong Korean. Although many languages are known to display domain-initial strengthening (Byrd 2000, Dilley, Shattuck- Hufnagel & Ostendorf 1996), it is yet unclear whether initial-syllable vowels preceded by consonants undergo it as well. This study presents the result of an experimental study of initial syllables in English and Hamkyeong Korean. Durations of initial-syllable vowels were compared to those of second vowels in real-word tokens for both languages. Hamkyeong Korean, like English, tuned out to strengthen the domain-initial consonants. With regard to vowel durations, we found no significant of prosodic effect in English. On the other hand, Hamkyeong Korean showed significant differences between durations of initial and non-initial vowels in the higher prosodic domains. The findings in the study are theoretically important as they reveal that the potentially-universal phenomenon of initial strengthening is subject to language specific variations in its implementation.

#18 Spectral characteristics of the release bursts in Korean alveolar stops [PDF] [Copy] [Kimi] [REL]

Author: Hansang Park

This study investigates spectral characteristics of the release bursts in Korean alveolar stops in terms of intensity, center of gravity, and skewness of the spectra of the release burst across phonation types and speakers. The results showed that there was no significant difference in intensity, center of gravity, or skewness across phonation types but a significant difference across speakers. This means that difference in phonation type does not lead to any significant difference in the spectra of the release burst. This study suggests that difference in the spectral characteristics of the release burst across phonation types can be ignored in speech synthesis or speech recognition.

#19 Frequency effects on vowel reduction in three typologically different languages (dutch, finish, Russian) [PDF] [Copy] [Kimi] [REL]

Authors: Rob Van Son ; Olga Bolotova ; Louis C. W. Pols ; Mietta Lennes

As a result of the cooperation in the Intas 915 project, annotated speech corpora have become available in three different languages for both read and spontaneous speech of some 4-5 male and 4-5 female speakers per language (6-10 minutes per speaker). These data have been used to study the effects of redundancy on acoustic vowel reduction, in terms of vowel duration, F1-F2 distance to a virtual target of reduction, spectral center of gravity, and vowel intensity. It was shown that in all three (typologically different) languages vowel redundancy increases acoustic reduction in the same way. The reduction of redundant vowels seems to be a language universal.

#20 Assessment of non-native phones in anglicisms by German listeners [PDF] [Copy] [Kimi] [REL]

Authors: Julia Abresch ; Stefan Breuer

By means of a pair comparison test, preferences of German native speakers for English or German sounds in spoken anglicisms were investigated. The collected data can be used as a reference point, which of the English xenophones have to be integrated into a German TTS system to allow for an appropriate pronunciation of anglicisms in German.

#21 Phonology of exceptions for for Korean grapheme-to-phoneme conversion [PDF] [Copy] [Kimi] [REL]

Author: Sunhee Kim

Being an essential part of a Korean speech recognition system and a Text-To-Speech (TTS) system, a Korean Grapheme-to-Phoneme conversion system is generally composed of a set of regular rules and an exceptions dictionary [1, 2, 3]. The exceptions have been recorded in the dictionary in a simple and random manner, whereas the researches on the regular rules have been actively progressed. This paper presents a systematic description of the exceptions for a Grapheme-to-Phoneme conversion system based on the analysis of entries of a lexical dictionary [4] from the phonological point of view, showing that the exceptions are related with certain limited phonological phenomena.

#22 Acoustic and prosodic analysis of Japanese vowel-vowel hiatus with laryngeal effect [PDF] [Copy] [Kimi] [REL]

Authors: Kitazawa Shigeyoshi ; Shinya Kiriyama

We investigated V-V hiatus through J-ToBI labeling and listening to whole phrases to estimate degree of discontinuity and to determine the exact boundary between two phrases if possible. Appropriate boundaries were found in most cases as the maximum perceptual score. Using electroglottography (EGG) and spectrogram, the acoustic phonological feature of these V-V hiatus was phrase-initial glottalization and phrase final nasalization observable in EGG and spectrogram, as well as phrase-final lengthening and phrase-initial shortening. The test materials are taken from the "Japanese MULTEXT", consisting of a particle - vowel (36), adjective - vowel (5), and word - word (4).

#23 A cross-linguistic acoustic comparison of unreleased word-final stops: Korean and Thai [PDF] [Copy] [Kimi] [REL]

Author: Kimiko Tsukada

This study compared acoustic characteristics of final stops in Korean and Thai. Word-final stops are phonetically realized as unreleased stops in these languages. Native speakers of Korean and Thai produced monosyllabic words ending with [p t k] in each of their native languages. Formant frequencies of /i a u/ at the vowel's offset were examined. In both languages, the place effect was significant and interacted with the vowel type. For non-front vowels (/a/ and /u/), F2 offset was highest before [t], while for the front vowel (/i/), it was highest before [k]. Preliminary results of a perception experiment with English-speaking listeners suggest that the absence of release bursts is most detrimental to the intelligibility of [k], least for [p] and intermediate for [t].

#24 Acoustic correlates of phrase-internal lexical boundaries in dutch [PDF] [Copy] [Kimi] [REL]

Authors: Taehong Cho ; Elizabeth K. Johnson

The aim of this study was to determine if Dutch speakers reliably signal phrase-internal lexical boundaries, and if so, how. Six speakers recorded 4 pairs of phonemically identical strong-weak-strong (SWS) strings with matching syllable boundaries but mismatching intended word boundaries (e.g. 'reis # pastei' versus 'reispas # tij', or more broadly C1V2(C)#C2V2(C)C3V3(C) vs. C1V2 (C)C2V2(C)#C3V3 (C)). An Analysis of Variance revealed 3 acoustic parameters that were significantly greater in S#WS items (C2 DURATION, RIME1 DURATION, C3 BURST AMPLITUDE) and 5 parameters that were significantly greater in the SW#S items (C2 VOT, C3 DURATION, RIME2 DURATION, RIME3 DURATION, and V2 AMPLITUDE). Additionally, center of gravity measurements suggested that the [s] to [t] coarticulation was greater in 'reis # pa[st]ei' versus 'reispa[s] # [t]ij'. Finally, a Logistic Regression Analysis revealed that the 3 parameters (RIME1 DURATION, RIME2 DURATION, and C3 DURATION) contributed most reliably to a S#WS versus SW#S classification.

#25 Phonotactics vs. phonetic cues in native and non-native listening: dutch and Korean listeners' perception of dutch and English [PDF] [Copy] [Kimi] [REL]

Authors: Taehong Cho ; James M. McQueen

We investigated how listeners of two unrelated languages, Dutch and Korean, process phonotactically legitimate and illegitimate sounds spoken in Dutch and American English. To Dutch listeners, unreleased word-final stops are phonotactically illegal because word-final stops in Dutch are generally released in isolation, but to Korean listeners, released final stops are illegal because word-final stops are never released in Korean. Two phoneme monitoring experiments showed a phonotactic effect: Dutch listeners detected released stops more rapidly than unreleased stops whereas the reverse was true for Korean listeners. Korean listeners with English stimuli detected released stops more accurately than unreleased stops, however, suggesting that acoustic-phonetic cues associated with released stops improve detection accuracy. We propose that in non-native speech perception, phonotactic legitimacy in the native language speeds up phoneme recognition, the richness of acoustic-phonetic cues improves listening accuracy, and familiarity with the non-native language modulates the relative influence of these two factors.