INTERSPEECH.2005 - Language and Multimodal

| Total: 202

#1 The effect of stress and boundaries on segmental duration in a corpus of authentic speech (british English) [PDF] [Copy] [Kimi] [REL]

Authors: Daniel Hirst, Caroline Bouzon

Research into the effect of stress and boundaries on segmental duration in speech has, for obvious reasons, most often been applied to carefully constructed sentences pronounced in laboratory conditions. The availability of a large labelled database of British English (Aix-Marsec) provides an opportunity to test different hypotheses concerning the factors influencing segmental duration from a corpus of authentic speech (defined as speech produced with the intent of communicating its meaning to the listener). In particular, in this paper, we look at the effect of stress and boundaries on prosodic structure in British English. Recent work has suggested that while word boundaries seem definitely to have a significant effect of the duration of segments, once the number of segments in the narrow rhythm unit is known, there is no orthogonal effect of word stress. In this study we look in particular at effects of word and intonation unit boundaries and at their possible interaction with stress and find that while intonation unit boundaries definitely affect segmental duration, no similar effect could be shown for word boundaries.


#2 Investigation of the relationship between turn-taking and prosodic features in spontaneous dialogue [PDF] [Copy] [Kimi] [REL]

Authors: Tomoko Ohsuga, Masafumi Nishida, Yasuo Horiuchi, Akira Ichikawa

In this study, we investigated the relationship between turn-taking and prosody. We considered that to interact smoothly in real-time communication, speakers must show presignals to turn-taking as prosodic features before turn edges. We attempted to discriminate the turn change by the decision tree method using only prosodic features in turn-final accentual phrases that include earlier positions compared with turn-final mora. In the discrimination experiment, we used the corpus of Japanese spontaneous dialogue, and defined prosodic parameters such as F0 contour, power contour and duration. We compared the two parameter conditions for using parameters with and without the final mora of turns. From the results, the accuracy under the conditions of not using the parameters of the final mora is 80%, which is not significantly worse than the result of 83% when using all parameters. Taking into account only prosody was used, we consider this result to be reasonably good.


#3 Filled pauses as cues to the complexity of following phrases [PDF] [Copy] [Kimi] [REL]

Authors: Michiko Watanabe, Keikichi Hirose, Yasuharu Den, Nobuaki Minematsu

Corpus based studies of spontaneous speech showed that filled pauses tended to precede relatively long and complex constituents. We examined whether listeners made use of such a tendency in speech processing. We tested the hypothesis that when listeners heard filled pauses they tended to expect a relatively long and complex phrase to follow. In the experiment participants listened to sentences referring to both simple and compound shapes presented on a computer screen. Their task was to press a button as soon as they had identified the shape that they heard. The sentences involved two factors: complexity and fluency. As the complexity factor, a half of the sentences described compound shapes with long and complex phrases and the other half described simple shapes with short and simple phrases. As the fluency factor phrases describing a shape had a preceding filled pause, a preceding silent pause of the same length as the filled pause, or no preceding pause. The results showed that response times for the complex phrases were significantly shorter after filled or silent pauses than when there was no pause. In contrast, there was no significant difference between the three conditions for the simple phrases. The results support the hypothesis and indicate that it is the duration of filled pauses that give listeners cues to the complexity of upcoming phrases.


#4 Perceptual magnet effect in German boundary tones [PDF] [Copy] [Kimi] [REL]

Authors: Katrin Schneider, Bernd Möbius

The experiment described in this paper tests for the perceptual magnet effect within the categories of high and low boundary tones in German, referring to question and statement, respectively. The experiment is based on previous work in which the categorical status of the two German boundary tones had been evaluated. The results found there showed that there was a discrimination ability within categories which could not be explained by the classical definition of categorical perception. The results reported in the present paper show that a perceptual magnet exists in the statement category but not in the question category.


#5 Constraints on the acquisition of simplex and complex words in German [PDF] [Copy] [Kimi] [REL]

Authors: Angela Grimm, Jochen Trommer

It is a common assumption that prosodic restrictions on the shape of children's early productions refer to the prosodic word (cf. [1]). However, empirical research on word structure has focused almost exclusively on simplex words where the morphosyntactic and prosodic word boundaries coincide ([2], [3], [4], [5]). In this paper, we provide new evidence from the acquisition of German complex words (compounds and particle verbs) showing that the restriction to a single foot indeed holds for the prosodic word, not for the morphosyntactic word. Thus, our results corroborate the crucial function of the prosodic word in language development.


#6 Whistled speech: a natural phonetic description of languages adapted to human perception and to the acoustical environment [PDF] [Copy] [Kimi] [REL]

Author: Julien Meyer

The scientific study of the whistled speech of several languages has already provided an alternative point of view on many aspects of language. After a general overview on the phenomenon, this paper develops a comparative analysis of several whistled forms of non tonal languages which are still in use. Meanwhile, the vocalic and consonantal reductions observed in this type of whistled speech are detailed thanks to a typological approach. It sheds a new light on the main aspects of the encoding strategy thanks to results of acoustic propagation and perceptive tests. Actually, whistled languages naturally take advantage of a narrow band of frequencies to focus on key elements of the phonology. They carry an essential part of the linguistic information that the listeners are able to recognize if they have overcome a long period of learning. Therefore, they can be seen as phonetic descriptions of local languages. Such properties are enabled by whistles which are remarkably adapted to the perceptive capacities of human beings and to the natural acoustic environment.


#7 The stress foot as a unit of planned timing: evidence from shortening in the prosodic phrase [PDF] [Copy] [Kimi] [REL]

Authors: Heejin Kim, Jennifer Cole

This study investigates whether the stress foot is a planned timing unit in American English, by examining the durational characteristics of the foot in three different prosodic contexts - i) within an intermediate phrase, ii) across an intermediate phrase and iii) across an intonational phrase. The results show that as the number of syllables in a foot increases, the duration of the foot increases, but the mean duration of syllables is reduced. Our examination of the internal structure of the foot reveals that there is a consistent shortening of stressed syllables within an intermediate phrase. These findings indicate that the stress foot within the intermediate phrase is a timing unit where durational shortening occurs in compensation for an increase in syllable count within the foot.


#8 Segmental "anchorage" and the French late rise [PDF] [Copy] [Kimi] [REL]

Authors: Pauline Welby, Hélène Loevenbruck

We examined the tonal alignment and scaling patterns of the start and end points of the French late rise, using a rate manipulation paradigm. Our findings call into question aspects of the segmental anchoring hypothesis: the low starting point of the late rise was not stably anchored to a segmental landmark, and for some speakers, F0 excursion size varied across rates. The position of the peak of the late rise was found to vary across syllable structures. To account for the observed patterns, we propose the notion of an "anchorage," that is, a region within which an intonational turning point can anchor. For the peak of the French late rise, this anchorage stretches from just before the end of the vowel of the last full syllable of the accentual phrase to the end of the phrase.


#9 Prosodic cues for syntactically-motivated junctures [PDF] [Copy] [Kimi] [REL]

Author: Ivan Chow

A pilot study was conducted to examine the manner in which Cantonese speakers use prosody to mark syntactic junctures in speech production. Test sentences were designed to create the intended experimental conditions - each sentence-pair consists of an identical array of morphemes with exactly two interpretations according to two different syntactic structures. These structures can be clarified by conveying prosodic boundaries that coincide with junctures between major syntactic constituents. The purpose of the experiment as well as the meaning of the test sentences (with respect to their syntactic structures) were explained to the subjects. After a few practice runs, they were asked to read these sentences aloud, conveying the marked boundaries, while their voices were being recorded. Three prosodic juncture markers: pauses, pitch reset and pre-boundary lengthening were under examination. Results of the acoustic and statistical analyses of the recorded signals indicated that, pauses were the most effective way of marking junctures, followed by pitch reset. Pre-boundary lengthening was found to be infrequent; in the rare cases where it was detected, pre-boundary syllables were only slightly longer than their non-boundary counterparts. Nonetheless, vast individual differences in terms of the amplitudes and frequencies of prosodic juncture markers were observed. The present study provides acoustical data regarding the manner in which Cantonese speakers use prosody in utterance structure clarification, vis-à-vis their specific language experience. In a large-scale experiment, test sentences will be embedded in contextual paragraphs that semantically and prosodically prompt readers to convey the intended prosodic structure. This experiment is underway and is expected to yield more conclusive results.


#10 A glimpse of the time-course of intonation processing in European Portuguese [PDF] [Copy] [Kimi] [REL]

Authors: Isabel Falé, Isabel Hub Faria

We have investigated the phenomenon of prediction in speech processing through intonational contrasts in European Portuguese (EP) grammar.


#11 Great expectations - introspective vs. perceptual prominence ratings and their acoustic correlates [PDF] [Copy] [Kimi] [REL]

Author: Petra Wagner

In order to gain knowledge about the interaction between top-down expectations of listeners concerning prosodic prominence and its acoustic correlates, two exploratory empirical studies were carried out. First, native and non-native subjects rated prominences of speech read at normal and very fast - prosodically very different - speech. Later, these ratings were compared with introspective prominence ratings of different listeners. First results indicate a major influence of the introspection on prominence ratings, especially if acoustic cues are difficult to interpret, as it is the case in very fast speech. Compared to native subjects, non-natives rely less on their introspection and more on the acoustics.


#12 Choosing a scale for measuring perceived prominence [PDF] [Copy] [Kimi] [REL]

Authors: Christian Jensen, John Tøndering

Three different scales which have been used to measure perceived prominence are evaluated in a perceptual experiment. Average scores of raters using a multi-level (31-point) scale, a simple binary (2-point) scale and an intermediate 4-point scale are almost identical. The potentially finer gradation possible with the multi-level scale(s) is compensated for by having multiple listeners, which is a also a requirement for obtaining reliable data. In other words, a high number of levels is neither a sufficient nor a necessary requirement. Overall the best results were obtained using the 4-point scale, and there seems to be little justification for using a 31-point scale.


#13 The effects of prosodic features on the interpretation of clarification ellipses [PDF] [Copy] [Kimi] [REL]

Authors: Jens Edlund, David House, Gabriel Skantze

In this paper, the effects of prosodic features on the interpretation of elliptical clarification requests in dialogue are studied. An experiment is presented where subjects were asked to listen to short human-computer dialogue fragments in Swedish, where a synthetic voice was making an elliptical clarification after a user turn. The prosodic features of the synthetic voice were systematically varied, and the subjects were asked to judge what was actually intended by the computer. The results show that an early low F0 peak signals acceptance, that a late high peak is perceived as a request for clarification of what was said, and that a mid high peak is perceived as a request for clarification of the meaning of what was said. The study can be seen as the beginnings of a tentative model for intonation of clarification ellipses in Swedish, which can be implemented and tested in spoken dialogue systems.


#14 Exploration of different types of intonational deviations in foreign-accented and synthesized speech [PDF] [Copy] [Kimi] [REL]

Author: Matthias Jilka

The study provides an analysis of the basic manifestations of intonational deviations in foreign-accented (American English accent in German) and synthesized speech. It takes into account the crucial influence of the used model of intonation description and makes a major distinction between individual deviations that cause the impression of foreignness or unnaturalness immediately when they occur, and others that do so only when an accumulation of several such deviations does not allow for a meaningful interpretation anymore. It is argued that this is due to the high variability allowed in prosodic contexts. A closer description of the first group of deviations includes the transfer of categories and of the phonetic realizations of categories as well as a discussion of seemingly unmotivated errors and the most likely causes of intonation errors in synthesized speech. Finally, it is shown that in the case of foreign accent the language-specific manifestations of the presented deviations combine to create a characteristic overall impression of foreignness that is recognizable independently of the segmental content of an utterance.


#15 A rhythmic-prosodic model of poetic speech [PDF] [Copy] [Kimi] [REL]

Author: Jörg Bröggelwirth

In this paper a new approach towards the analysis of speech rhythm is presented. In the speech rhythm literature it was often discussed that rhythmic phenomena are more transparent in the metrical structure of orally produced poetry. However, up to now only a few phoneticians have worked on this special speaking style. For analyzing the rhythmic and prosodic patterns of this kind of speech, a corpus of read German poetry, including four different meters, was recorded. This study gives a first sight on durational and intonational effects in the data. A final prosodic modeling and its perceptual evaluation is currently under development.


#16 Fine-tuning speech registers: a comparison of the prosodic features of child-directed and foreigner-directed speech [PDF] [Copy] [Kimi] [REL]

Authors: Sonja Biersack, Vera Kempe, Lorna Knapton

The present study compares prosodic features of child-directed speech (CDS) and foreigner-directed speech (FDS), to examine whether FDS is a derivative of CDS as suggested in sociolinguistic studies. Twelve female speakers completed a simple referential communication task addressed to an imaginary adult, an imaginary foreigner, and an imaginary child.


#17 An analysis of the intonational structure of stuttered speech [PDF] [Copy] [Kimi] [REL]

Author: Timothy Arbisi-Kelm

While previous studies have successfully revealed areas vulnerable to disfluency at the word level in stuttering, identifying the specific factors responsible for this instability has proved difficult. Analyzing the effects of phrasal prosody, which governs such word-level factors as lexical stress [1], is critical in order to account for the relations between word-level and phrase-level effects, and how they affect patterns of disfluency in stuttered speech. In a story-telling task performed by two stutterers and two control subjects, it was found that stutterers' disfluencies were accompanied by more prosodic irregularities prior to the actual cause of the disfluency. In particular, changes in f0 and duration affect the realization of cues in the disfluent environment, resulting in fundamental alterations of intonational phrase structure. Anticipatory and target-realized disfluencies contribute different acoustic cues to their immediate environments, the results of which often create conflicting prominence relationships among prosodic constituents, thereby losing information key for conveying meaning.


#18 Voice quality dimensions of pitch accents [PDF] [Copy] [Kimi] [REL]

Authors: Britta Lintfert, Wolfgang Wokurek

Acoustic and electroglottographic (EGG) measurements were used to examine voice quality parameters during the production of the rising and falling pitch movements in German. The vowels /a:/ and /E/ were studied in a single-speaker speech corpus. The acoustic measurements comprised an automatic spectral analysis of the glottal parameters open quotient (OQ), glottal opening (GO), skewness of glottal pulse (SK), rate of closure (RC), amplitude of voicing (AV) and completeness of closure (CC). OQ and AV seem to be the only acoustic parameters influenced by pitch accent and not by word stress. From the electroglottographic measurements only open quotient parameters (OQI and OQII), two parameters of closing phase (SCV and SCA) and two parameters of opening phase (EOV and EOT) showed a significant difference as a function of pitch accent type.


#19 Audiovisual production and perception of contrastive focus in French: a multispeaker study [PDF] [Copy] [Kimi] [REL]

Authors: Marion Dohen, Hélène Loevenbruck

This study examines the visual cues to prosodic contrastive focus in Hexagonal French and their role in visual speech perception. Two audiovisual corpora were recorded (from two male native speakers of French) consisting of sentences with a subject-verb-object (SVO) syntactic structure. Four conditions were studied: focus on each phrase (S,V,O) and broad focus. The corpora were first acoustically validated. Then lip area and jaw opening were extracted from the video. For each speaker, we identified a set of visible correlates of contrastive focus. The combined results showed that there were consistent visible articulatory correlates of contrastive focus across speakers: a) an increase in lip area and its first derivative on the focused item b) a lengthening of the focal syllables. There were also speaker-specific strategies in the amount of a) pre-focal anticipation or b) post-focal hypo-articulation.


#20 Predicting end of utterance in multimodal and unimodal conditions [PDF] [Copy] [Kimi] [REL]

Authors: Pashiera Barkhuysen, Emiel Krahmer, Marc Swerts

In this paper, we describe a series of perception studies on uniand multimodal cues to end of utterance. Stimuli were fragments taken from a recorded interview session, consisting of the parts in which speakers provided answers. The answers varied in length and were presented without the preceding question of the interviewer. The subjects had to predict when the speaker would finish his turn, based on video material and/or auditory material. The experiment consisted of 3 conditions: in one condition, the stimuli were presented as they were recorded (both audio and vision), in the two remaining conditions stimuli were presented in only the auditory or the visual channel. Results show that the audiovisual condition evoked the fastest reaction times and the visual condition the slowest. Arguably, the combination of cues from different modalities function as complementary sources and might thus improve prediction.


#21 Production of prominence in Japanese sign language [PDF] [Copy] [Kimi] [REL]

Authors: Saori Tanaka, Masafumi Nishida, Yasuo Horiuchi, Akira Ichikawa

In sign language research, technically it has been possible to investigate prominence around a unit of sign movements that realizes strong visual impression. Based on the researches of speech prominence, this study proposes techniques to delimit a sequential hand-movement into small units, and investigates the prominence by the comparisons of physical properties on each unit between emphasized signing and non-emphasized signing in Japanese Sign Language. The original data for this paper came from 3 native signers who produced 3 sentences, 5 times in 2 modes. The result of Factor Analysis showed that prominence on the lexical parts of the sign movements was most distinctive through all subjects and examples. The varieties in the transitional part of the sign movements and the longer pause insertions before the lexical part of the sign movements were also observed.


#22 Fast vocabulary-independent audio search using path-based graph indexing [PDF] [Copy] [Kimi] [REL]

Authors: Olivier Siohan, Michiel Bacchiani

Classical audio retrieval techniques consist in transcribing audio documents using a large vocabulary speech recognition system and indexing the resulting transcripts. However, queries that are not part of the recognizer's vocabulary or have a large probability of getting misrecognized can significantly impair the performance of the retrieval system. Instead, we propose a fast vocabulary independent audio search approach that operates on phonetic lattices and is suitable for any query. However, indexing phonetic lattices so that any arbitrary phone sequence query can be processed efficiently is a challenge, as the choice of the indexing unit is unclear. We propose an inverted index structure on lattices that uses paths as indexing features. The approach is inspired by a general graph indexing method that defines an automatic procedure to select a small number of paths as indexing features, keeping the index size small while allowing fast retrieval of the lattices matching a given query. The effectiveness of the proposed approach is illustrated on broadcast news and Switchboard databases.


#23 The effects of speech recognition and punctuation on information extraction performance [PDF] [Copy] [Kimi] [REL]

Authors: John Makhoul, Alex Baron, Ivan Bulyko, Long Nguyen, Lance Ramshaw, David Stallard, Richard Schwartz, Bing Xiang

We report on experiments to measure the effect of speech recognition errors and automatic punctuation insertion errors on the performance of information extraction (entity and relation extraction). The outputs of several recognition systems with a range of word error rates (WER), along with punctuation insertion, were fed into a system that extracts entities and relations from the recognized text. Entity and relation value scores were measured as a function of WER and types of punctuation used. The results of the experiments showed that both entity and relation value scores degrade linearly with increasing WER, with a relative reduction in scores of about twice the WER. The information extraction modules require the inclusion of sentence boundaries, at a minimum; however, the experiments showed that the exact locations of these boundaries are not important for entity and relation extraction. In contrast, when comparing the effects of full punctuation to just automatic sentence boundary insertion, there was a loss in entity value scores of 13.5% and in relation value scores of 25%. Further, commas play a significantly greater role in entity and relation extraction than other types of punctuation.


#24 Indexing uncertainty for spoken document search [PDF] [Copy] [Kimi] [REL]

Authors: Ciprian Chelba, Alex Acero

The paper presents the Position Specific Posterior Lattice, a novel lossy representation of automatic speech recognition lattices that naturally lends itself to efficient indexing and subsequent relevance ranking of spoken documents. Albeit lossy, the PSPL lattice is much more compact than the ASR 3-gram lattice from which it is computed, at virtually no degradation in word-error-rate performance. Since new paths are introduced in the lattice, the "oracle" accuracy increases over the original ASR lattice. In experiments performed on a collection of lecture recordings - MIT iCampus database - the spoken document ranking accuracy was improved by 20% relative over the commonly used baseline of indexing the 1-best output from an automatic speech recognizer. The Mean Average Precision (MAP) increased from 0.53 when using 1-best output to 0.62 when using the new lattice representation. The reference used for evaluation is the output of a standard retrieval engine working on the manual transcription of the speech collection.


#25 Exploiting passage retrieval for n-best rescoring of spoken questions [PDF] [Copy] [Kimi] [REL]

Authors: Tomoyosi Akiba, Hiroyuki Abe

Speech interfaces using LVCSR system have promise for improving the utility of Open-domain Question Answering, in which natural language questions about diversified topics are used as inputs. In this paper, we propose a method to improve both speech recognition and question answering performance by incorporating the passage retrieval, which is a component common to many QA systems, with respect to the target documents that the input question asked about. In the QA process, the passage that has the high similarity to the question has the high possibility to have the correct answer in it. Conversely, this similarity can be used to select the appropriate candidate from N-best list of speech recognition results. From language modeling perspective, this process can be seen to capture the semantic consistency of spoken question in sentence level as compared with conventional n-gram language models. We show the effectiveness of our method by means of experiments.