INTERSPEECH.2008 - Others

Total: 264

#1 Agglomerative hierarchical speaker clustering using incremental Gaussian mixture cluster modeling [PDF] [Copy] [Kimi]

Authors: Kyu J. Han ; Shrikanth S. Narayanan

This paper proposes a novel cluster modeling method for intercluster distance measurement within the framework of agglomerative hierarchical speaker clustering, namely, incremental Gaussian mixture cluster modeling. This method uses a single Gaussian distribution to model each initial cluster, but represents any newly merged cluster using a distribution whose pdf is the weighted sum of the pdf's of the respective model distributions for the clusters involved in the particular merging process. As a result, clusters are smoothly transitioned to be modeled by Gaussian mixtures whose components are incremented as merging recursions continue during clustering. The proposed method can overcome the limited cluster representation capability of conventional single Gaussian cluster modeling. Through experiments on various sets of initial clusters, it is demonstrated that our approach consequently improves the reliability of speaker clustering performance.

#2 Weighted segmental k-means initialization for SOM-based speaker clustering [PDF] [Copy] [Kimi]

Authors: Oshry Ben-Harush ; Itshak Lapidot ; Hugo Guterman

A new approach for initial assignment of data in a speaker clustering application is presented. This approach employs Weighted Segmental K-Means clustering algorithm prior to competitive based learning. The clustering system relies on Self-Organizing Maps (SOM) for speaker modeling and likelihood estimation. Performance is evaluated on 108 two speaker conversations taken from LDC CALLHOME American English Speech corpus using NIST criterion and shows an improvement of approximately 48% in Cluster Error Rate (CER) relative to the randomly initialized clustering system. The number of iterations was reduced significantly, which contributes to both speed and efficiency of the clustering system.

#3 Learning essential speaker sub-space using hetero-associative neural networks for speaker clustering [PDF] [Copy] [Kimi]

Authors: Shajith Ikbal ; Karthik Visweswariah

In this paper, we present a novel approach to speaker clustering involving the use of hetero-associative neural network (HANN) to compute very low dimensional speaker discriminatory features (in our case 1-dimensional) in a data-driven manner. A HANN trained to map input feature space onto speaker labels through a bottle-neck hidden layer is expected to learn very low dimensional feature subspace essentially containing speaker information. The lower dimensional features are further used in a simple k-means clustering algorithm to obtain speaker segmentation. Evaluation of this approach on a database of real-life conversational speech from call-centers show that clustering performance achieved is similar to that of the state-of-the-art systems, although our approach uses just 1-dimensional features. Augmenting these features with the traditional mel-frequency cepstral coefficients (MFCC) features in the state-of-the-art system resulted in improved clustering performance.

#4 Two's a crowd: improving speaker diarization by automatically identifying and excluding overlapped speech [PDF] [Copy] [Kimi]

Authors: Kofi Boakye ; Oriol Vinyals ; Gerald Friedland

We present an update to our initial work [1] on overlapped speech detection for improving speaker diarization. Specifically, we describe the addition of new features and feature warping techniques that improve segmenter and, consequently, diarization performance. We also demonstrate improved diarization performance by additionally using overlap segment information in a new diarization pre-processing step which excludes overlap segments from speaker clustering. On a subset of the AMI Meeting Corpus we show that this overlap exclusion step nearly triples the relative improvement of diarization error rate as compared to overlap segment post-processing alone.

#5 T-test distance and clustering criterion for speaker diarization [PDF] [Copy] [Kimi]

Authors: Trung Hieu Nguyen ; Eng Siong Chng ; Haizhou Li

In this paper, we present an application of student's t-test to measure the similarity between two speaker models. The measure is evaluated by comparing with other distance metrics: the Generalized Likelihood Ratio, the Cross Likelihood Ratio and the Normalized Cross Likelihood Ratio in speaker detection task. We also propose an objective criterion for speaker clustering. The criterion deduces the number of speakers automatically by maximizing the separation between intra-speaker distances and inter-speaker distances. It requires no development data and works well with various distance metrics. We then report the performance of our proposed similarity distance measure and objective criterion in speaker diarization task. The system produces competitive results: low speaker diarization error rate and high accuracy in detecting number of speakers.

#6 Integration of TDOA features in information bottleneck framework for fast speaker diarization [PDF] [Copy] [Kimi]

Authors: Deepu Vijayasenan ; Fabio Valente ; Hervé Bourlard

In this paper we address the combination of multiple feature streams in a fast speaker diarization system for meeting recordings. Whenever Multiple Distant Microphones (MDM) are used, it is possible to estimate the Time Delay of Arrival (TDOA) for different channels. In [1], it is shown that TDOA can be used as additional features together with conventional spectral features for improving speaker diarization. We investigate here the combination of TDOA and spectral features in a fast diarization system based on the Information Bottleneck principle. We evaluate the algorithm on the NIST RT06 diarization task. Adding TDOA features to spectral features reduces the speaker error by 7% absolute. Results are comparable to those of conventional HMM/GMM based systems with consistent reduction in computational complexity.

#7 Duration and F0 interval of utterance-final intonation contours in the perception of German sentence modality [PDF] [Copy] [Kimi]

Authors: Benno Peters ; Hartmut R. Pfitzinger

This paper investigates the influence of duration and F0 interval of the utterance-final F0 contour on the perception of sentence modality, i.e. declarative or interrogative sentence. An utterancefinal rising contour with a constant F0 interval of 2 semitones or more and a voicing duration of at least 50 ms leads to unanimously identified interrogative modality. Even at durations of 20 and 30 ms a significant number of listeners is able to consistently identify sentence modality. F0 interval seems to better predict perceived sentence modality than F0 slope.

#8 Contrastive utterances make alternatives salient - cross-modal priming evidence [PDF] [Copy] [Kimi]

Authors: Bettina Braun ; Lara Tagliapietra ; Anne Cutler

Sentences with contrastive intonation are assumed to presuppose contextual alternatives to the accented elements. Two cross-modal priming experiments tested in Dutch whether such contextual alternatives are automatically available to listeners. Contrastive associates - but not non-contrastive associates - were facilitated only when primes were produced in sentences with contrastive intonation, indicating that contrastive intonation makes unmentioned contextual alternatives immediately available. Possibly, contrastive contours trigger a "presupposition resolution mechanism" by which these alternatives become salient.

#9 Exploring a mechanism of speech sychronization using auditory delayed experiments [PDF] [Copy] [Kimi]

Authors: Masato Ishizaki ; Yasuharu Den ; Senshi Fukashiro

This paper investigated how speakers synchronize their speech by experiments in which the participants naturally and simultaneously recited under auditory delayed conditions. Statistical analysis revealed that the speakers changed strategies to adjust the timing of their utterances. This finding constitutes one fundamental mechanism for coordinating verbal behavior that can contribute to designing comfortable interactions with virtual agents or robots.

#10 Prosodic manifestations of confidence and uncertainty in spoken language [PDF] [Copy] [Kimi]

Author: Heather Pon-Barry

We present a project aimed at understanding the acoustic and prosodic correlates of confidence and uncertainty in spoken language. We elicited speech produced under varying levels of certainty and performed perceptual and statistical analyses on the speech data to determine which prosodic features (e.g., pitch, energy, timing) are associated with a speaker's level of certainty and where these prosodic manifestations occur relative to the location of the word or phrase that the speaker is confident or uncertain about. Our findings suggest that prosodic manifestations of confidence and uncertainty occur both in the local region that causes the uncertainty as well as in its surrounding context.

#11 Identifying relevant phrases to summarize decisions in spoken meetings [PDF] [Copy] [Kimi]

Authors: Raquel Fernandez ; Matthew Frampton ; John Dowding ; Anish Adukuzhiyil ; Patrick Ehlen ; Stanley Peters

We address the problem of identifying words and phrases that accurately capture, or contribute to, the semantic gist of decisions made in multi-party human-human meetings. We first describe our approach to modelling decision discussions in spoken meetings and then compare two approaches to extracting information from these discussions. The first one uses an open-domain semantic parser that identifies candidate phrases for decision summaries and then employs machine learning techniques to select from those candidate phrases. The second one uses categorical and sequential classifiers that exploit simple syntactic and semantic features to identify words and phrases relevant for decision summarization.

#12 Recovering participant identities in meetings from a probabilistic description of vocal interaction [PDF] [Copy] [Kimi]

Authors: Kornel Laskowski ; Tanja Schultz

An important decision in the design of automatic conversation understanding systems is the level at which information streams representing specific participants are merged. In the current work, we explore participant-dependence of low-level interactive aspects of conversation, namely the observed contextual preferences for talkspurt deployment. We argue that strong participant-dependence at this level gives cause for merging participant streams as early as possible. We demonstrate that our probabilistic description of talkspurt deployment preferences is strongly participant-dependent, and frequently predictive of participant identity.

#13 Multidimensional features of emotional speech [PDF] [Copy] [Kimi]

Authors: Tomoko Suzuki ; Machiko Ikemoto ; Tomoko Sano ; Toshihiko Kinoshita

The purpose of this study is to investigate the features of emotional speech by means of multidimensional scaling procedure(MDS) based on visual-perceived similarity of vocal parameters. We extracted three vocal parameters (pitch, intensity and spectrogram) from speeches expressed emotions. Three researchers grouped together the cards of parameters in view of visual similarity. According to the result of MDS of spectrogram, we found two dimensions, plesureness (positive-negative) and activation(high activation - low activation), which are similar in structure to auditory perception in vocal emotions. Finally, we concluded that features of spectrogram related to pleasureness.

#14 Leveraging emotion detection using emotions from yes-no answers [PDF] [Copy] [Kimi]

Authors: Narjes Boufaden ; Pierre Dumouchel

We present a new approach for the detection of negative versus non-negative emotions from Human-computer dialogs in the specific domain of call centers. We argue that it is possible to improve emotion detection without using additional information being linguistic or contextual. We show that no-answers are emotional salient words and that it is possible to improve the accuracy of the classification of Human-computer dialogs by taking advantage of the high accuracy achieved on no-answer turns. We also show that stacked generalization using neural networks and SVM as base models improves the accuracy of each model while the combination of the no-model and the dialog model improves the accuracy of the dialog-model alone by 13%.

#15 Vowel placement during operatic singing: 'come si parla' or 'aggiustamento'? [PDF] [Copy] [Kimi]

Authors: Thomas J. Millhouse ; Dianna T. Kenny

This study explored two tenets of the Italian Bel Canto operatic singing technique. "Come si parla" and "aggiustamento." Articulatory changes in the lower formant vowel space of 11 spoken and sung vowels were systematically examined in six male singers. Results showed that singers influence the placement of the lowest formant frequencies in the sung vowel space using both a lowered larynx and modified vowel articulation (aggiustamento) with rising pitch, especially above 220Hz.

#16 Study on strained rough voice as a conveyer of rage [PDF] [Copy] [Kimi]

Authors: Yumiko O. Kato ; Yoshifumi Hirose ; Takahiro Kamai

It is important to be able to determine anger and its degree for dialog management in an interactive speech interface. We investigated the characteristics of a strained rough voice as a conveyer of a speaker's degree of anger. In hot anger speech in Japanese, a rough voice with high glottal tension is observed frequently, and the rate of occurrence of the strained rough voice increases according to the degree of anger. In a typical male speaker's speech sample, amplitude fluctuations observed in a strained rough voice were periodic; the frequency was around between 40.80 Hz. The modulation ratio in rage speech was larger than in other emotional states, suggesting the possibility of determining the speaker's anger and its degree by detecting strained rough voice.

#17 Integrating rule and template-based approaches for emotional Malay speech synthesis [PDF] [Copy] [Kimi]

Authors: Mumtaz Begum ; Raja N. Ainon ; Roziati Zainuddin ; Zuraidah M. Don ; Gerry Knowles

The manipulation of prosody, including pitch, duration and intensity, is one of the leading approaches in synthesizing emotion. This paper reports work on the development of a Malay Emotional synthesizer capable of expressing four basic emotions, namely happiness, anger, sadness and fear for any form of text input with various intonation patterns using the prosody manipulation principle. The synthesizer makes use of prosody templates and prosody parametric manipulation for different types of sentence structure.

#18 The expression and perception of emotions: comparing assessments of self versus others [PDF] [Copy] [Kimi]

Authors: Carlos Busso ; Shrikanth S. Narayanan

In the study of expressive speech communication, it is commonly accepted that the emotion perceived by the listener is a good approximation of the intended emotion conveyed by the speaker. This paper analyzes the validity of this assumption by comparing the mismatches between the assessments made by naive listeners and by the speakers that generated the data. The analysis is based on the hypothesis that people are better decoders of their own emotions. Therefore, self-assessments will be closer to the intended emotions. Using the IEMOCAP database, discrete (categorical) and continuous (attribute) emotional assessments evaluated by the actors and naive listeners are compared. The results indicate that there is a mismatch between the expression and perception of emotion. The speakers in the database assigned their own emotions to more specific emotional categories, which led to more extreme values in the activation-valence space.

#19 On the role of acting skills for the collection of simulated emotional speech [PDF] [Copy] [Kimi]

Authors: Emiel Krahmer ; Marc Swerts

We experimentally compared non-simulated with simulated expressions of emotion produced both by inexperienced and by experienced actors. Contrary to our expectations, in a perception experiment participants rated the expressions of experienced actors as more extreme and less like non-simulated ("real") expressions than those produced by non-professional actors.

#20 Detection of security related affect and behaviour in passenger transport [PDF] [Copy] [Kimi]

Authors: Björn Schuller ; Matthias Wimmer ; Dejan Arsic ; Tobias Moosmayr ; Gerhard Rigoll

Surveillance of drivers, pilots or passengers possesses significant potential for increased security within passenger transport. In an automotive setting the interaction can e.g. be improved by social awareness of an MMI. As further example security marshals can be efficiently positioned guided by according systems. Within this scope the detection of security relevant behavior patterns as aggressiveness or stress is discussed. The focus lies on real-life usage respecting online processing, subject independency, and noise robustness. The approach introduced employs multivariate time-series analysis for the synchronization and data reduction of audio and video by brute-force feature generation. By combined optimization of the large audiovisual space accuracy is boosted. Extensive results are reported on aviation behavior, as well as in particular for the audio channel on numerous standard corpora. The influence of noise will be discussed by representative car-noise overlay.

#21 Emotions and articulatory precision [PDF] [Copy] [Kimi]

Authors: Martijn Goudbeek ; Jean Philippe Goldman ; Klaus R. Scherer

The influence of emotion on articulatory precision was investigated in a newly established corpus of acted emotional utterances. The area of the vocalic triangle between the vowels /i/, /u/, and /a/ was measured and shown to be significantly affected by emotion. Furthermore, this area correlated significantly with the potency dimension of a large scale study of emotion words, reflecting the predictions of the component process model of emotion.

#22 Assessing agreement of observer- and self-annotations in spontaneous multimodal emotion data [PDF] [Copy] [Kimi]

Authors: Khiet P. Truong ; Mark A. Neerincx ; David A. van Leeuwen

We investigated inter-observer agreement and the reliability of self-reported emotion ratings (i.e., self-raters judging their own emotions) in spontaneous multimodal emotion data. During a multiplayer video game, vocal and facial expressions were recorded (including the game content itself) and were annotated by the players themselves on arousal and valence scales. In a perception experiment, observers rated a small part of the data that was provided in 4 conditions: audio only, visual only, audiovisual and audiovisual plus context. Inter-observer agreements varied between 0.32 and 0.52 when the ratings were scaled. Providing multimodal information usually increased agreement. Finally, we found that the averaged agreement between the self-rater and the observers was somewhat lower than the inter-observer agreement.

#23 Emotion recognition in spontaneous emotional speech for anonymity-protected voice chat systems [PDF] [Copy] [Kimi]

Authors: Yoshiko Arimoto ; Hiromi Kawatsu ; Sumio Ohno ; Hitoshi Iida

For the purpose of determining emotion recognition by acoustic information, we recorded natural dialogs made by two or three players of online games to construct an emotional speech database. Two evaluators categorized recorded utterances in a certain emotion, which were defined with reference to the eight primary emotions of Plutchik's three-dimensional circumplex model. Furthermore, 14 evaluators graded utterances using a 5-point scale of subjective evaluation to obtain reference degrees of emotion. Eleven acoustic features were extracted from utterances and analysis of variance (ANOVA) was conducted to assess significant differences between emotions. Based on the results of ANOVA, we conducted discriminant analysis to discriminate one emotion from the others. Moreover, the experiment estimating emotional degree was conducted with multiple linear regression analysis to estimate emotional degree for each utterance. As a result of discriminant analysis, high correctness values of 79.12% for Surprise and 70.11% for Sadness were obtained, and over 60% correctness were obtained for most of the other emotions. As for emotional degree estimation, values of the adjusted R square (.R2) for each emotion ranged from 0.05 (Disgust) to 0.55 (Surprise) for closed sets, and values of root mean square (RMS) of residual for open sets ranged from 0.39 (Acceptance) to 0.59 (Anger).

#24 Assigning suitable phrasal tones and pitch accents by sensing affective information from text to synthesize human-like speech [PDF] [Copy] [Kimi]

Authors: Mostafa Al Masum Shaikh ; Md. Khademul Islam Molla ; Keikichi Hirose

We have carried out several perceptual and objective experiments that show that the present Text-To-Speech (TTS) systems are weak in the relevance of prosody and segmental spectrum in the characterization and expression of emotions. Since it is known that the emotional state of a speaker usually alters the way s/he speaks, the TTS systems need to be improved to generate human-like pitch accents to express the subtle features of emotions. This paper describes a pitch accent assignment technique which places appropriate pitch accents on elements of the utterance that require particular emphasis or stress. Our pitch accenting technique utilizes commonsense knowledge-base and a linguistic tool to recognize emotion conveyed though the text itself. From these it determines whether the content of the utterance has a connotation to a particular emotion (e.g., happy, sad, surprise etc.), good or bad concepts, praiseworthy or blameworthy actions, common or vital information. It can then assign an appropriate pitch accent to one word in each prosodic phrase. The TTS component then determines the appropriate syllable to be accented in the word. Our approach can well support a TTS system's synthesis, allowing the system to generate affective version of the spoken text.

#25 Cross-language study of vocal correlates of affective states [PDF] [Copy] [Kimi]

Authors: Irena Yanushevskaya ; Ailbhe Ní Chasaide ; Christer Gobl

This paper is concerned with a cross-cultural study of vocal correlates of affect. Speakers of 4 languages, Irish-English, Russian, Spanish and Japanese, were asked to judge affective content of synthesised stimuli of three types: (1) stimuli varying in voice quality, with a neutral pitch contour; (2) stimuli with affect-related f0 contours and modal voice; and (3) stimuli in which specific voice qualities and affect-related f0 contours were combined. Some of the main results are illustrated and point to similarities among the language groups as well as some striking cross-language/culture differences in how these stimuli map to affect.