INTERSPEECH.2008 - Keynote

Total: 4

#1 In search of models in speech communication research [PDF] [Copy] [Kimi]

Author: Hiroya Fujisaki

This paper first presents the author's personal view on the importance of modeling in scientific research in general, and then describes two of his works toward modeling certain aspects of human speech communication. The first work is concerned with the physiological and physical mechanisms of controlling the voice fundamental frequency of speech, which is an important parameter for expressing information on tone, accent, and intonation. The second work is concerned with the cognitive processes involved in a discrimination test of speech stimuli, which gives rise to the phenomenon of so-called categorical perception. They are meant to illustrate the power of models based on deep understanding and precise formulation of the functions of the mechanisms/processes that underlie observed phenomena. Finally, it also presents the author's view on some models that are yet to be developed.

#2 Dealing with limited and noisy data in ASR: a hybrid knowledge-based and statistical approach [PDF] [Copy] [Kimi]

Author: Abeer Alwan

In this talk, I will focus on the importance of integrating knowledge of human speech production and speech perception mechanisms, and language-specific information with statistically-based, datadriven approaches to develop robust and scalable automatic speech recognition (ASR) systems. As we will demonstrate, the need for such hybrid systems is especially critical when the ASR system is dealing with noisy data, when adaptation data are limited (for the case of speaker normalization and adaptation), and when dealing with accents.

#3 Forensic automatic speaker recognition: fiction or science? [PDF] [Copy] [Kimi]

Author: Joaquin Gonzalez-Rodriguez

Hollywood films and CSI-like movies show a technology landscape far from real, both in forensic speaker recognition and other identification-of-the-source forensic areas. Lay persons are used to good-looking scientist-and-investigators performing voice identifications ("we got a match!") or smart fancy devices producing voice transformations causing one actor to instantaneously talk with the voice of other. Simultaneously, Forensic Identification Science is facing a global challenge impelled firstly by progressively higher requirements for admissibility of expert testimony in Court and secondly by the transparent and testable nature of DNA typing, which is now seen as the new gold-standard model of a scientifically defensible approach to be emulated by all other identification-of-the-source areas. In this presentation we will show how forensic speaker recognition can comply with the requirements of transparency and testability in forensic science This will lead to fulfilling the court requirements about role separation between scientists and judges/juries, and bring about integration in a forensically adequate framework in which the scientist provides the appropriate information necessary to the court's decision processes.

#4 Modelling rapport in embodied conversational agents [PDF] [Copy] [Kimi]

Author: Justine Cassell

In this talk I report on a series of studies that attempt to characterize the role of language and nonverbal behavior in relationship-building and rapport in humans, and then to use the results to implement embodied conversational agents capable of rapport with their users. In particular, we are implementing virtual survey interviewers that can use rapport to elicit truthful responses, and virtual direction-giving agents that behave differently as they give directions over the lifetime of use. We are implementing virtual peers that can engage in collaborative learning with children within different dialect communities, virtual peers that can scaffold the learning of rapport behaviors in children with autism spectrum disorder, and virtual peers that can be used to assess the social skills deficits of children with autism spectrum disorder so as to better plan their treatment. The goal of the research program is to better understand linguistic and nonverbal coordination devices from the utterance level to the relationship level: how they work in humans, how they can be modeled in virtual humans, and how virtual humans can be implemented to help humans have productive and satisfying relationships, with machines and with one another, over long perids of time