INTERSPEECH.2005 - Others | Cool Papers - Immersive Paper Discovery

#1 The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets [PDF] [Copy] [Kimi]

Authors: Alan W. Black ; Keiichi Tokuda

In order to better understand different speech synthesis techniques on a common dataset, we devised a challenge that will help us better compare research techniques in building corpus-based speech synthesizers. In 2004, we released the first two 1200-utterance single-speaker databases from the CMU ARCTIC speech databases, and challenged current groups working in speech synthesis around the world to build their best voices from these databases. In January of 2005, we released two further databases and a set of 50 utterance texts from each of five genres and asked the participants to synthesize these utterances. Their resulting synthesized utterances were then presented to three groups of listeners: speech experts, volunteers, and US English-speaking undergraduates. This paper summarizes the purpose, design, and whole process of the challenge.

#2 A probabilistic approach to unit selection for corpus-based speech synthesis [PDF] [Copy] [Kimi]

Authors: Shinsuke Sakai ; Han Shu

In this paper, we present a novel statistical approach to corpusbased speech synthesis. Unit selection is directed by probabilistic models for F0 contour, duration, and spectral characteristics of the synthesis units. The F0 targets for units are modeled by statistical additive models, and duration targets are modeled by regression trees. Spectral targets for a unit is modeled by Gaussian mixtures on MFCC-based features. Goodness of concatenation of two units is modeled by conditional Gaussian models on MFCC-based features. Although the system is in its early stage of development, we implemented an English speech synthesizer with CMU Arctic corpora and confirmed the effectiveness of this new framework.

#3 The blizzard challenge 2005 CMU entry - a method for improving speech synthesis systems [PDF] [Copy] [Kimi]

Authors: John Kominek ; Christina L. Bennett ; Brian Langner ; Arthur R. Toth

In CMU's Blizzard Challenge 2005 entry we investigated twelve ideas for improving Festival-based unit selection voices. We tracked progress by adopting a 3-tiered strategy in which candidate ideas must pass through three stages of listening tests to warrant inclusion in the final build. This allowed us to evaluate ideas consistently without us having large human resources at our disposal, and thereby improve upon our baseline system within a short amount of time.

#4 Automatic personal synthetic voice construction [PDF] [Copy] [Kimi]

Authors: H. Timothy Bunnell ; Chris Pennington ; Debra Yarrington ; John Gray

We describe techniques used for automatic personal synthetic voice creation in our laboratory. These techniques are implemented in two pieces of software. One, called InvTool, guides novice users in the process of recording a corpus of speech that is appropriate for creation of a concatenative synthetic voice. The other program, called BCC, compiles a speech corpus recorded with InvTool into a database appropriate for use with the ModelTalker TTS system. Our primary goal in this project is to develop software to support "voice banking" wherein individuals at risk to lose the ability to speak will be able to record their own personal synthetic voice for later use in voice output communication devices.

#5 An overview of nitech HMM-based speech synthesis system for blizzard challenge 2005 [PDF] [Copy] [Kimi]

Authors: Heiga Zen ; Tomoki Toda

In the present paper, hidden Markov model (HMM) based speech synthesis system developed in Nagoya Institute of Technology (Nitech-HTS) for a competition of text-to-speech synthesis systems using the same speech databases, named Blizzard Challenge 2005, is described. We show an overview of the basic HMM-based speech synthesis system and then recent developments to the latest one such as STRAIGHT-based vocoding, hidden semi-Markov model (HSMM) based acoustic modeling, and parameter generation considering global variance are illustrated. Constructed voices can synthesize speech around 0.3 xRT (real time ratio) and their footprints are less than 2 MB. The listening test results show that performances of our systems are much better than we expected.

#6 On building a concatenative speech synthesis system from the blizzard challenge speech databases [PDF] [Copy] [Kimi]

Authors: Wael Hamza ; Raimo Bakis ; Zhi Wei Shuang ; Heiga Zen

In this paper, we compare two methods of building a concatenative speech synthesis system from the relatively small, "Blizzard Challenge" speech databases. In the first method we build a system directly from the Blizzard databases using the IBM Concatenative Speech Synthesis System originally designed for very large speech databases. In the second method, a larger database is used to build the synthesis system and the output is "morphed" to match the speakers in the Blizzard databases. The second method outperformed the first while maintaining the identity of the Blizzard target speakers.

#7 Multisyn voices from ARCTIC data for the blizzard challenge [PDF] [Copy] [Kimi]

Authors: Robert A. J. Clark ; Korin Richmond ; Simon King

This paper describes the process of building unit selection voices for the Festival multisyn engine using four ARCTIC datasets, as part of the Blizzard evaluation challenge. The build procedure is almost entirely automatic, with very little need for human intervention. We discuss the difference in the evaluation results for each voice and evaluate the suitability of the ARCTIC datasets for building this type of voice.

#8 Large scale evaluation of corpus-based synthesizers: results and lessons from the blizzard challenge 2005 [PDF] [Copy] [Kimi]

Author: Christina L. Bennett

The Blizzard Challenge 2005 was a large scale international evaluation of various corpus-based speech synthesis systems using common datasets. Six sites from around the world, both academic and industrial, participated in this evaluation, the first ever to compare voices built by different systems using the same data. Here we describe results of the evaluation and many of the observations and lessons discovered in carrying it out.

#9 Speech retrieval of Mandarin broadcast news via mobile devices [PDF] [Copy] [Kimi]

Authors: Berlin Chen ; Yi-Ting Chen ; Chih-Hao Chang ; Hung-Bin Chen

This paper presents a system for speech retrieval of Mandarin broadcast news. First, several data-driven and unsupervised approaches are integrated into the broadcast news transcription system to improve the speech recognition accuracy and efficiency. Then, a multi-scale indexing paradigm for broadcast news retrieval is proposed to make use of the special structural properties of the Chinese language as well as to alleviate the problems caused by the speech recognition errors. Finally, we use the PDA as the platform and Mandarin broadcast news stories collected in Taiwan as the document collection to establish a speech-based multimedia information retrieval prototype system. Very encouraging results are obtained.

#10 State estimation of meetings by information fusion using Bayesian network [PDF] [Copy] [Kimi]

Authors: Michiaki Katoh ; Kiyoshi Yamamoto ; Jun Ogata ; Takashi Yoshimura ; Futoshi Asano ; Hideki Asoh ; Nobuhiko Kitawaki

In this paper, a method of structuring the multi-media recording of a small-sized meeting based on various information such as sound source localization, multiple-talk detection, and the detection of non-speech sound events, is proposed. The information from these detectors is fused by a Bayesian network to estimate the state of the meeting. Based on the estimated state, the recording of the meeting is structured using a XML-based description language and is visualized by a browser.

#11 Results from a survey of attendees at ASRU 1997 and 2003 [PDF] [Copy] [Kimi]

Author: Roger K. Moore

In 1997 the author conducted a survey at the IEEE workshop on Automatic Speech Recognition and Understanding' (ASRU) in which attendees were offered a set of twelve putative future events to which they were asked to assign a date. Six years later at ASRU'2003, the author repeated the survey with the addition of eight additional items. This paper presents the combined results from both surveys.

#12 Speech processing in the networked home environment - a view on the amigo project [PDF] [Copy] [Kimi]

Authors: Reinhold Haeb-Umbach ; Basilis Kladis ; Joerg Schmalenstroeer

Full interoperability of networked devices in the home has been kind of an elusive concept for quite some years. Amigo, an Integrated Project within the EU 6-th framework program, tries to make home networking a reality by addressing two key issues: First, it brings together many major players in the domestic appliances, communications, consumer electronics and computer industry to develop a common open source middleware platform. Second, emphasis is placed on the development of intelligent user services that make the benefit of a networked home environment tangible for the end user. This paper shows how speech processing can contribute to this second goal of user-friendly, personalized, context-aware services.

#13 Fixed distortion segmentation in efficient sound segment searching [PDF] [Copy] [Kimi]

Author: Masahide Sugiyama

Searching query signal from stored signal is formulated as a segment searching problem where signal is converted into a sequence of feature vectors. As an efficient segment searching algorithm, a new pruning method of candidates in the segment sequence has been proposed and the effectiveness has been shown through experimental results. The proposed searching algorithm is 20 - 30 times faster than the conventional Active Searching algorithm. As the first step of the proposed method distortion based segmentation is carried out. As searching criterion is based on l1 norm, the segmentation is expected to be carried out using l1 criterion. This paper compares two segmentation methods; maximum l1 distortion segmentation and average l2 distortion segmentation. The average l2 distortion segmentation is very efficient. On the other hand, the maximum l1 distortion segmentation does not require radius information. The experimental results show that two methods have almost equal performance in segment searching when the number of segments are same.

#14 Identifying singers of popular songs [PDF] [Copy] [Kimi]

Authors: Tin Lay Nwe ; Haizhou Li

In this paper, we propose to identify the singers of popular songs using vibrato characteristics and high level musical knowledge of song structure. The proposed framework starts with a vocal detection process followed by a hypothesis test for the vocal/nonvocal verification. This method allows us to select vocal segments of high confidence for singer identification. From the selected vocal segments, the cepstral coefficients which reflect the vibrato characteristics are computed using the parabola bandpass filters spread according to the music frequency scale. The strategy in our classifier formulation is to utilize the high level musical knowledge of song structure in singer modeling. The proposed framework is validated on a database containing 84 popular songs of commercially available CD records from 12 singers. We achieve an average error rate of 17.9% in segment level identification.

#15 Speech repair: quick error correction just by using selection operation for speech input interfaces [PDF] [Copy] [Kimi]

Authors: Jun Ogata ; Masataka Goto

In this paper, we propose a novel speech input interface function, called "Speech Repair" in which recognition errors can be easily corrected by selecting candidates. During the speech input, this function displays not only the typical speech-recognition result but also other competitive candidates. Each word in the result is separated by line segments and accompanied by other word candidates. A user who finds a recognition error can simply select the correct word from the candidates for that temporal region. In order to overcome the difficulty of generating appropriate candidates, we adopted a confusion network that can condense a huge internal word graph of a large vocabulary continuous speech recognition (LVCSR) system. In our experiments, almost all recognition errors were corrected and the effectiveness of speech repair was confirmed.

#16 Steerable highly directional audio beam loudspeaker [PDF] [Copy] [Kimi]

Authors: Dirk Olszewski ; Fransiskus Prasetyo ; Klaus Linhard

This paper presents a method of steering audible sound beams generated by parametric arrays in air. A hybrid system consisting of combined electronic / mechanical phased array technique is used. Although commercially available emitter technology has been used and therefore several guidelines known from prior-art phased arrays could not be realized, sufficient beam steering performance has been reached.

#17 Automatic music genre classification using second-order statistical measures for the prescriptive approach [PDF] [Copy] [Kimi]

Authors: Hassan Ezzaidi ; Jean Rouat

Several works proposed for the automatic genre musical classification are based on various combinations of parameters, exploiting different models. However, the comparison of all previous works remain impossible since they used different target taxonomies, genre definitions and databases. In this paper, the world largest music database (Real World Computing) is used. Also, different measures related to second-order statistics methods are investigated to achieve the genre classification. Various strategies are proposed for training and testing sessions such as matched conditions, mismatched conditions, long training/testing, long training and short testing. For all experiments, the section of file used in testing has never been presented during the training session. The best classifier achieved 97% and 69% performance when matched and mismatched conditions are used, respectively.

#18 Effect of head orientation on the speaker localization performance in smart-room environment [PDF] [Copy] [Kimi]

Authors: Alberto Abad ; Dusan Macho ; Carlos Segura ; Javier Hernando ; Climent Nadeu

Reliable measures of speaker positions are needed for computational perception of human activities taking place in a smart-room environment. In this work, we investigate the effect of talkers head orientation on the accuracy of acoustical source localization techniques and its relation with the talker directivity pattern and room reverberation. Two different representative speaker localization techniques are assessed, steered response power and a crossing lines based method, in both cases on the basis of the estimated delays between pairs of microphones with the GCC-PHAT algorithm. A small database has been collected at the UPC's smart room for evaluation. The results show how the localization error heavily depends on the head orientation, and also the fact that the space exploration based technique is much more robust to head orientation changes than the crossing lines technique, due to the way the contributions from the various microphones are combined.

#19 Application of automatic speaker recognition techniques to pathological voice assessment (dysphonia) [PDF] [Copy] [Kimi]

Authors: Corinne Fredouille ; G. Pouchoulin ; Jean-François Bonastre ; M. Azzarello ; A. Giovanni ; A. Ghio

This paper investigates the adaptation of Automatic Speaker Recognition (ASR) techniques to the pathological voice assessment (dysphonic voices). The aim of this study is to provide a novel method, suitable for keeping track of the evolution of the patient's pathology: easy-to-use, fast, non-invasive for the patient, and affordable for the clinicians. This method will be complementary to the existing ones - the perceptual judgment and the usual objective measurement (jitter, airflows...) which remain time and human resource consuming.

#20 Adaptive speech analytics: system, infrastructure, and behavior [PDF] [Copy] [Kimi]

Authors: Upendra V. Chaudhari ; Ganesh N. Ramaswamy ; Eddie Epstein ; Sasha P. Caskey ; Mohamed Kamal Omar

This paper describes an adaptive system and infrastructure for Speech Analytics, based on the UIMA framework and consisting of a set of analysis engines (analytics) and control units, whose input is an unspecified and ever changing number of continuous streams of audio data and whose output is the detection of events consistent with a focus of analysis and/or the discovery of relationships among the outputs of the constituent analytics in the system. The central theme presented concerns the ability of the system to use the meta-data generated during the analysis to adapt both the behavior of the underlying analytics engines and the overall data flow to adjust the granularity and accuracy of the analysis in order to allow processing of increasing amounts of data with limited resources.

#21 Lexical tone perception in musicians and non-musicians [PDF] [Copy] [Kimi]

Authors: Jennifer A. Alexander ; Patrick C. M. Wong ; Ann R. Bradlow

It has been suggested that music and speech maintain entirely dissociable mental processing systems. The current study, however, provides evidence that there is an overlap in the processing of certain shared aspects of the two. This study focuses on fundamental frequency (pitch), which is an essential component of melodic units in music and lexical and/or intonational units in speech. We hypothesize that extensive experience with the processing of musical pitch can transfer to a lexical pitch-processing domain. To that end, we asked nine English-speaking musicians and nine English-speaking non-musicians to identify and discriminate the four lexical tones of Mandarin Chinese. The subjects performed significantly differently on both tasks; the musicians identified the tones with 89% accuracy and discriminated them with 87% accuracy, while the non-musicians identified them with only 69% accuracy and discriminated them with 71% accuracy. These results provide counter-evidence to the theory of dissociation between music and speech processing.

#22 Contextual effect on perception of lexical tones in Cantonese [PDF] [Copy] [Kimi]

Authors: Joan K.-Y. Ma ; Valter Ciocca ; Tara Whitehill

The present study investigated the role of tonal context (extrinsic information) in the perception of Cantonese lexical tones. Target tones at three separate positions (initial, medial and final position) were recorded by two speakers (one male and one female). These sentences were edited and presented in three conditions: original carrier (target within the original context), isolation (target without context) and neutral carrier (target word as appended at the final apposition within a new carrier). Nine female listeners were asked to identify the tones by matching targets with Chinese characters. Perceptual data showed that tones presented within the original carrier were more accurately perceived than targets presented in isolation, showing the importance of extrinsic information in the perception of lexical tones. In the neutral carrier condition, tones of the final position showed perceptual accuracy significantly above targets of the initial and medial positions. The perceptual error patterns suggested that listeners placed more emphasis on the immediate context preceding the target in tone identification. When tones were presented without an extrinsic context, the proportion of errors for each tone differed. Most of the errors involved misidentifying targets as tones of same F0 contour but different level. The results showed that the importance of extrinsic information on the perception of lexical tones was mainly on identification of F0 level while the intrinsic acoustic properties of the tone helped in identifying the F0 contour.

#23 Visual cues in Mandarin tone perception [PDF] [Copy] [Kimi]

Authors: Hansjörg Mixdorff ; Yu Hu ; Denis Burnham

This paper presents results concerning the exploitation of visual cues in the perception of Mandarin tones. The lower part of a female speaker's face was recorded on digital video as she uttered 25 sets of syllabic tokens covering the four different tones of Mandarin. Then in a perception study the audio sound track alone, as well an audio plus video condition were presented to native Mandarin speakers who were required to decide which tone they perceived. Audio was presented in various conditions: clear, babble-noise masked at different SNR levels, as well as devoiced and amplitude-modulated noise conditions using LPC resynthesis. In the devoiced and the clear audio conditions, there is little augmentation of audio alone due to the addition of video. However, the addition of visual information did significantly improve perception in the babble-noise masked condition, and this effect increased with decreasing SNR. This outcome suggests that the improvement in noise-masked conditions is not due to additional information in the video per se, but rather to an effect of early integration of acoustic and visual cues facilitating auditory-visual speech perception.

#24 Cross-language perception of word stress [PDF] [Copy] [Kimi]

Authors: Hansjörg Mixdorff ; Yu Hu

This paper presents a study of the perception of Mandarin disyllabic words by native speakers of German. It examines how speakers of an accent language perceive word stress in words from a tone language. A corpus of 15 sets of words with all possible combinations of the four tones of Mandarin was recorded by a professional speaker. In addition monotonized versions of the words were created. In a forced-choice listening experiment native speakers of German were asked to assess whether they perceived the word stress on the first or second syllable. Results include, inter alia, that words with two high tones, as well as the monotonized stimuli were predominantly perceived as carrying the word stress on the first syllable. Words with a falling tone on the second syllable were mostly classified as carrying stress on the second syllable, with the combination of low and falling tone yielding the highest score. Many combinations of tones, however, could not be identified as any of the two kinds. This suggests that though some tonal configurations in Mandarin are similar to German two-syllable word accent patterns and can be associated with the latter, others might be rather interpreted as pertaining to two mono-syllabic words, both of which are stressed.

#25 The lexical statistics of word recognition problems caused by L2 phonetic confusion [PDF] [Copy] [Kimi]

Author: Anne Cutler

Phonemic confusions in L2 listening lead to three types of problem at the lexical level: inability to distinguish minimal pairs (e.g. write, light), spurious activation of embedded words (e.g. write in delighted) and delay in resolution of ambiguity (e.g. distinction between register and legislate at the sixth instead of the first phoneme) The statistics of each of these, computed from a 70,000+ word English lexicon backed by frequency statistics from a 17.9 million word corpus, establish that each causes substantial added difficulty for L2 listeners.