INTERSPEECH.2008

| Total: 762

#1 In search of models in speech communication research [PDF] [Copy] [Kimi] [REL]

Author: Hiroya Fujisaki

This paper first presents the author's personal view on the importance of modeling in scientific research in general, and then describes two of his works toward modeling certain aspects of human speech communication. The first work is concerned with the physiological and physical mechanisms of controlling the voice fundamental frequency of speech, which is an important parameter for expressing information on tone, accent, and intonation. The second work is concerned with the cognitive processes involved in a discrimination test of speech stimuli, which gives rise to the phenomenon of so-called categorical perception. They are meant to illustrate the power of models based on deep understanding and precise formulation of the functions of the mechanisms/processes that underlie observed phenomena. Finally, it also presents the author's view on some models that are yet to be developed.

#2 Dealing with limited and noisy data in ASR: a hybrid knowledge-based and statistical approach [PDF] [Copy] [Kimi] [REL]

Author: Abeer Alwan

In this talk, I will focus on the importance of integrating knowledge of human speech production and speech perception mechanisms, and language-specific information with statistically-based, datadriven approaches to develop robust and scalable automatic speech recognition (ASR) systems. As we will demonstrate, the need for such hybrid systems is especially critical when the ASR system is dealing with noisy data, when adaptation data are limited (for the case of speaker normalization and adaptation), and when dealing with accents.

#3 Forensic automatic speaker recognition: fiction or science? [PDF] [Copy] [Kimi] [REL]

Author: Joaquin Gonzalez-Rodriguez

Hollywood films and CSI-like movies show a technology landscape far from real, both in forensic speaker recognition and other identification-of-the-source forensic areas. Lay persons are used to good-looking scientist-and-investigators performing voice identifications ("we got a match!") or smart fancy devices producing voice transformations causing one actor to instantaneously talk with the voice of other. Simultaneously, Forensic Identification Science is facing a global challenge impelled firstly by progressively higher requirements for admissibility of expert testimony in Court and secondly by the transparent and testable nature of DNA typing, which is now seen as the new gold-standard model of a scientifically defensible approach to be emulated by all other identification-of-the-source areas. In this presentation we will show how forensic speaker recognition can comply with the requirements of transparency and testability in forensic science This will lead to fulfilling the court requirements about role separation between scientists and judges/juries, and bring about integration in a forensically adequate framework in which the scientist provides the appropriate information necessary to the court's decision processes.

#4 Modelling rapport in embodied conversational agents [PDF] [Copy] [Kimi] [REL]

Author: Justine Cassell

In this talk I report on a series of studies that attempt to characterize the role of language and nonverbal behavior in relationship-building and rapport in humans, and then to use the results to implement embodied conversational agents capable of rapport with their users. In particular, we are implementing virtual survey interviewers that can use rapport to elicit truthful responses, and virtual direction-giving agents that behave differently as they give directions over the lifetime of use. We are implementing virtual peers that can engage in collaborative learning with children within different dialect communities, virtual peers that can scaffold the learning of rapport behaviors in children with autism spectrum disorder, and virtual peers that can be used to assess the social skills deficits of children with autism spectrum disorder so as to better plan their treatment. The goal of the research program is to better understand linguistic and nonverbal coordination devices from the utterance level to the relationship level: how they work in humans, how they can be modeled in virtual humans, and how virtual humans can be implemented to help humans have productive and satisfying relationships, with machines and with one another, over long perids of time

#5 Agglomerative hierarchical speaker clustering using incremental Gaussian mixture cluster modeling [PDF] [Copy] [Kimi] [REL]

Authors: Kyu J. Han ; Shrikanth S. Narayanan

This paper proposes a novel cluster modeling method for intercluster distance measurement within the framework of agglomerative hierarchical speaker clustering, namely, incremental Gaussian mixture cluster modeling. This method uses a single Gaussian distribution to model each initial cluster, but represents any newly merged cluster using a distribution whose pdf is the weighted sum of the pdf's of the respective model distributions for the clusters involved in the particular merging process. As a result, clusters are smoothly transitioned to be modeled by Gaussian mixtures whose components are incremented as merging recursions continue during clustering. The proposed method can overcome the limited cluster representation capability of conventional single Gaussian cluster modeling. Through experiments on various sets of initial clusters, it is demonstrated that our approach consequently improves the reliability of speaker clustering performance.

#6 Weighted segmental k-means initialization for SOM-based speaker clustering [PDF] [Copy] [Kimi] [REL]

Authors: Oshry Ben-Harush ; Itshak Lapidot ; Hugo Guterman

A new approach for initial assignment of data in a speaker clustering application is presented. This approach employs Weighted Segmental K-Means clustering algorithm prior to competitive based learning. The clustering system relies on Self-Organizing Maps (SOM) for speaker modeling and likelihood estimation. Performance is evaluated on 108 two speaker conversations taken from LDC CALLHOME American English Speech corpus using NIST criterion and shows an improvement of approximately 48% in Cluster Error Rate (CER) relative to the randomly initialized clustering system. The number of iterations was reduced significantly, which contributes to both speed and efficiency of the clustering system.

#7 Learning essential speaker sub-space using hetero-associative neural networks for speaker clustering [PDF] [Copy] [Kimi] [REL]

Authors: Shajith Ikbal ; Karthik Visweswariah

In this paper, we present a novel approach to speaker clustering involving the use of hetero-associative neural network (HANN) to compute very low dimensional speaker discriminatory features (in our case 1-dimensional) in a data-driven manner. A HANN trained to map input feature space onto speaker labels through a bottle-neck hidden layer is expected to learn very low dimensional feature subspace essentially containing speaker information. The lower dimensional features are further used in a simple k-means clustering algorithm to obtain speaker segmentation. Evaluation of this approach on a database of real-life conversational speech from call-centers show that clustering performance achieved is similar to that of the state-of-the-art systems, although our approach uses just 1-dimensional features. Augmenting these features with the traditional mel-frequency cepstral coefficients (MFCC) features in the state-of-the-art system resulted in improved clustering performance.

#8 Two's a crowd: improving speaker diarization by automatically identifying and excluding overlapped speech [PDF] [Copy] [Kimi] [REL]

Authors: Kofi Boakye ; Oriol Vinyals ; Gerald Friedland

We present an update to our initial work [1] on overlapped speech detection for improving speaker diarization. Specifically, we describe the addition of new features and feature warping techniques that improve segmenter and, consequently, diarization performance. We also demonstrate improved diarization performance by additionally using overlap segment information in a new diarization pre-processing step which excludes overlap segments from speaker clustering. On a subset of the AMI Meeting Corpus we show that this overlap exclusion step nearly triples the relative improvement of diarization error rate as compared to overlap segment post-processing alone.

#9 T-test distance and clustering criterion for speaker diarization [PDF] [Copy] [Kimi] [REL]

Authors: Trung Hieu Nguyen ; Eng Siong Chng ; Haizhou Li

In this paper, we present an application of student's t-test to measure the similarity between two speaker models. The measure is evaluated by comparing with other distance metrics: the Generalized Likelihood Ratio, the Cross Likelihood Ratio and the Normalized Cross Likelihood Ratio in speaker detection task. We also propose an objective criterion for speaker clustering. The criterion deduces the number of speakers automatically by maximizing the separation between intra-speaker distances and inter-speaker distances. It requires no development data and works well with various distance metrics. We then report the performance of our proposed similarity distance measure and objective criterion in speaker diarization task. The system produces competitive results: low speaker diarization error rate and high accuracy in detecting number of speakers.

#10 Integration of TDOA features in information bottleneck framework for fast speaker diarization [PDF] [Copy] [Kimi] [REL]

Authors: Deepu Vijayasenan ; Fabio Valente ; Hervé Bourlard

In this paper we address the combination of multiple feature streams in a fast speaker diarization system for meeting recordings. Whenever Multiple Distant Microphones (MDM) are used, it is possible to estimate the Time Delay of Arrival (TDOA) for different channels. In [1], it is shown that TDOA can be used as additional features together with conventional spectral features for improving speaker diarization. We investigate here the combination of TDOA and spectral features in a fast diarization system based on the Information Bottleneck principle. We evaluate the algorithm on the NIST RT06 diarization task. Adding TDOA features to spectral features reduces the speaker error by 7% absolute. Results are comparable to those of conventional HMM/GMM based systems with consistent reduction in computational complexity.

#11 Duration and F0 interval of utterance-final intonation contours in the perception of German sentence modality [PDF] [Copy] [Kimi] [REL]

Authors: Benno Peters ; Hartmut R. Pfitzinger

This paper investigates the influence of duration and F0 interval of the utterance-final F0 contour on the perception of sentence modality, i.e. declarative or interrogative sentence. An utterancefinal rising contour with a constant F0 interval of 2 semitones or more and a voicing duration of at least 50 ms leads to unanimously identified interrogative modality. Even at durations of 20 and 30 ms a significant number of listeners is able to consistently identify sentence modality. F0 interval seems to better predict perceived sentence modality than F0 slope.

#12 Contrastive utterances make alternatives salient - cross-modal priming evidence [PDF1] [Copy] [Kimi] [REL]

Authors: Bettina Braun ; Lara Tagliapietra ; Anne Cutler

Sentences with contrastive intonation are assumed to presuppose contextual alternatives to the accented elements. Two cross-modal priming experiments tested in Dutch whether such contextual alternatives are automatically available to listeners. Contrastive associates - but not non-contrastive associates - were facilitated only when primes were produced in sentences with contrastive intonation, indicating that contrastive intonation makes unmentioned contextual alternatives immediately available. Possibly, contrastive contours trigger a "presupposition resolution mechanism" by which these alternatives become salient.

#13 Exploring a mechanism of speech sychronization using auditory delayed experiments [PDF] [Copy] [Kimi] [REL]

Authors: Masato Ishizaki ; Yasuharu Den ; Senshi Fukashiro

This paper investigated how speakers synchronize their speech by experiments in which the participants naturally and simultaneously recited under auditory delayed conditions. Statistical analysis revealed that the speakers changed strategies to adjust the timing of their utterances. This finding constitutes one fundamental mechanism for coordinating verbal behavior that can contribute to designing comfortable interactions with virtual agents or robots.

#14 Prosodic manifestations of confidence and uncertainty in spoken language [PDF] [Copy] [Kimi] [REL]

Author: Heather Pon-Barry

We present a project aimed at understanding the acoustic and prosodic correlates of confidence and uncertainty in spoken language. We elicited speech produced under varying levels of certainty and performed perceptual and statistical analyses on the speech data to determine which prosodic features (e.g., pitch, energy, timing) are associated with a speaker's level of certainty and where these prosodic manifestations occur relative to the location of the word or phrase that the speaker is confident or uncertain about. Our findings suggest that prosodic manifestations of confidence and uncertainty occur both in the local region that causes the uncertainty as well as in its surrounding context.

#15 Identifying relevant phrases to summarize decisions in spoken meetings [PDF] [Copy] [Kimi] [REL]

Authors: Raquel Fernandez ; Matthew Frampton ; John Dowding ; Anish Adukuzhiyil ; Patrick Ehlen ; Stanley Peters

We address the problem of identifying words and phrases that accurately capture, or contribute to, the semantic gist of decisions made in multi-party human-human meetings. We first describe our approach to modelling decision discussions in spoken meetings and then compare two approaches to extracting information from these discussions. The first one uses an open-domain semantic parser that identifies candidate phrases for decision summaries and then employs machine learning techniques to select from those candidate phrases. The second one uses categorical and sequential classifiers that exploit simple syntactic and semantic features to identify words and phrases relevant for decision summarization.

#16 Recovering participant identities in meetings from a probabilistic description of vocal interaction [PDF] [Copy] [Kimi] [REL]

Authors: Kornel Laskowski ; Tanja Schultz

An important decision in the design of automatic conversation understanding systems is the level at which information streams representing specific participants are merged. In the current work, we explore participant-dependence of low-level interactive aspects of conversation, namely the observed contextual preferences for talkspurt deployment. We argue that strong participant-dependence at this level gives cause for merging participant streams as early as possible. We demonstrate that our probabilistic description of talkspurt deployment preferences is strongly participant-dependent, and frequently predictive of participant identity.

#17 Multidimensional features of emotional speech [PDF] [Copy] [Kimi] [REL]

Authors: Tomoko Suzuki ; Machiko Ikemoto ; Tomoko Sano ; Toshihiko Kinoshita

The purpose of this study is to investigate the features of emotional speech by means of multidimensional scaling procedure(MDS) based on visual-perceived similarity of vocal parameters. We extracted three vocal parameters (pitch, intensity and spectrogram) from speeches expressed emotions. Three researchers grouped together the cards of parameters in view of visual similarity. According to the result of MDS of spectrogram, we found two dimensions, plesureness (positive-negative) and activation(high activation - low activation), which are similar in structure to auditory perception in vocal emotions. Finally, we concluded that features of spectrogram related to pleasureness.

#18 Leveraging emotion detection using emotions from yes-no answers [PDF] [Copy] [Kimi] [REL]

Authors: Narjes Boufaden ; Pierre Dumouchel

We present a new approach for the detection of negative versus non-negative emotions from Human-computer dialogs in the specific domain of call centers. We argue that it is possible to improve emotion detection without using additional information being linguistic or contextual. We show that no-answers are emotional salient words and that it is possible to improve the accuracy of the classification of Human-computer dialogs by taking advantage of the high accuracy achieved on no-answer turns. We also show that stacked generalization using neural networks and SVM as base models improves the accuracy of each model while the combination of the no-model and the dialog model improves the accuracy of the dialog-model alone by 13%.

#19 Vowel placement during operatic singing: 'come si parla' or 'aggiustamento'? [PDF] [Copy] [Kimi] [REL]

Authors: Thomas J. Millhouse ; Dianna T. Kenny

This study explored two tenets of the Italian Bel Canto operatic singing technique. "Come si parla" and "aggiustamento." Articulatory changes in the lower formant vowel space of 11 spoken and sung vowels were systematically examined in six male singers. Results showed that singers influence the placement of the lowest formant frequencies in the sung vowel space using both a lowered larynx and modified vowel articulation (aggiustamento) with rising pitch, especially above 220Hz.

#20 Study on strained rough voice as a conveyer of rage [PDF] [Copy] [Kimi] [REL]

Authors: Yumiko O. Kato ; Yoshifumi Hirose ; Takahiro Kamai

It is important to be able to determine anger and its degree for dialog management in an interactive speech interface. We investigated the characteristics of a strained rough voice as a conveyer of a speaker's degree of anger. In hot anger speech in Japanese, a rough voice with high glottal tension is observed frequently, and the rate of occurrence of the strained rough voice increases according to the degree of anger. In a typical male speaker's speech sample, amplitude fluctuations observed in a strained rough voice were periodic; the frequency was around between 40.80 Hz. The modulation ratio in rage speech was larger than in other emotional states, suggesting the possibility of determining the speaker's anger and its degree by detecting strained rough voice.

#21 Integrating rule and template-based approaches for emotional Malay speech synthesis [PDF] [Copy] [Kimi] [REL]

Authors: Mumtaz Begum ; Raja N. Ainon ; Roziati Zainuddin ; Zuraidah M. Don ; Gerry Knowles

The manipulation of prosody, including pitch, duration and intensity, is one of the leading approaches in synthesizing emotion. This paper reports work on the development of a Malay Emotional synthesizer capable of expressing four basic emotions, namely happiness, anger, sadness and fear for any form of text input with various intonation patterns using the prosody manipulation principle. The synthesizer makes use of prosody templates and prosody parametric manipulation for different types of sentence structure.

#22 The expression and perception of emotions: comparing assessments of self versus others [PDF] [Copy] [Kimi] [REL]

Authors: Carlos Busso ; Shrikanth S. Narayanan

In the study of expressive speech communication, it is commonly accepted that the emotion perceived by the listener is a good approximation of the intended emotion conveyed by the speaker. This paper analyzes the validity of this assumption by comparing the mismatches between the assessments made by naive listeners and by the speakers that generated the data. The analysis is based on the hypothesis that people are better decoders of their own emotions. Therefore, self-assessments will be closer to the intended emotions. Using the IEMOCAP database, discrete (categorical) and continuous (attribute) emotional assessments evaluated by the actors and naive listeners are compared. The results indicate that there is a mismatch between the expression and perception of emotion. The speakers in the database assigned their own emotions to more specific emotional categories, which led to more extreme values in the activation-valence space.

#23 On the role of acting skills for the collection of simulated emotional speech [PDF] [Copy] [Kimi] [REL]

Authors: Emiel Krahmer ; Marc Swerts

We experimentally compared non-simulated with simulated expressions of emotion produced both by inexperienced and by experienced actors. Contrary to our expectations, in a perception experiment participants rated the expressions of experienced actors as more extreme and less like non-simulated ("real") expressions than those produced by non-professional actors.

#24 Detection of security related affect and behaviour in passenger transport [PDF] [Copy] [Kimi] [REL]

Authors: Björn Schuller ; Matthias Wimmer ; Dejan Arsic ; Tobias Moosmayr ; Gerhard Rigoll

Surveillance of drivers, pilots or passengers possesses significant potential for increased security within passenger transport. In an automotive setting the interaction can e.g. be improved by social awareness of an MMI. As further example security marshals can be efficiently positioned guided by according systems. Within this scope the detection of security relevant behavior patterns as aggressiveness or stress is discussed. The focus lies on real-life usage respecting online processing, subject independency, and noise robustness. The approach introduced employs multivariate time-series analysis for the synchronization and data reduction of audio and video by brute-force feature generation. By combined optimization of the large audiovisual space accuracy is boosted. Extensive results are reported on aviation behavior, as well as in particular for the audio channel on numerous standard corpora. The influence of noise will be discussed by representative car-noise overlay.

#25 Emotions and articulatory precision [PDF] [Copy] [Kimi] [REL]

Authors: Martijn Goudbeek ; Jean Philippe Goldman ; Klaus R. Scherer

The influence of emotion on articulatory precision was investigated in a newly established corpus of acted emotional utterances. The area of the vocalic triangle between the vowels /i/, /u/, and /a/ was measured and shown to be significantly affected by emotion. Furthermore, this area correlated significantly with the potency dimension of a large scale study of emotion words, reflecting the predictions of the component process model of emotion.