| Total: 6
This paper presents the virtual speech cuer built in the context of the ARTUS project aiming at watermarking hand and face gestures of a virtual animated agent in a broadcasted audiovisual sequence. For deaf televiewers that master cued speech, the animated agent can be then superimposed - on demand and at the reception - on the original broadcast as an alternative to subtitling. The paper presents the multimodal text-to-speech synthesis system and the first evaluation performed by deaf users.
One use of text-to-speech synthesis (TTS) is as a component of speechto- speech translation systems. The output of automatic machine translation (MT) can vary widely in quality, however. A synthetic voice that is extremely intelligible on naturally-occurring text may be far less intelligible when asked to render text that is automatically generated. In this paper, we compare the quality of synthesis of naturally-occurring text and its MT counterpart. We find that intelligibility of TTS on MT output is significantly lower than on either naturally-occurring text or semantically unpredictable sentences, and explore the reasons why.
This paper describes a technique for controlling voice quality of synthetic speech using multiple regression hidden semi-Markov model (HSMM). In the technique, we assume that the mean vectors of output and state duration distribution of HSMM are modeled by multiple regression with a parameter vector called voice quality control vector. We first choose three features for controlling voice qualities, that is, "smooth voice - nonsmooth voice," "warm - cold," "high-pitched - low-pitched," and then we attempt to control voice quality of synthetic speech for these features. From the results of several subjective tests, we show that the proposed technique can change these features of voice quality intuitively.
In speech technology it is very important to have a system capable of accurately performing grapheme-to-phoneme (G2P) conversion, which is not an easy task especially if talking about languages like English where there is no obvious letter-phone correspondence. Manual rules so widely used before are now leaving the way open for the machine learning techniques and language independent tools. In this paper we present an extension of the use of transformationbased error-driven algorithm to G2P task. A set of explicit rules was inferred to correct the pronunciation for U.S. English, Spanish and Catalan using well-known machine-learning techniques in combination with transformation based algorithm. All methods applied in combination with transformation rules significantly outperform the results obtained by these methods alone.
This paper describes a novel framework of voice conversion (VC). We call it eigenvoice conversion (EVC). We apply EVC to the conversion from a source speakers voice to arbitrary target speakers voices. Using multiple parallel data sets consisting of utterance-pairs of the source and multiple pre-stored target speakers, a canonical eigenvoice GMM (EV-GMM) is trained in advance. That conversion model enables us to flexibly control the speaker individuality of the converted speech by manually setting weight parameters. In addition, the optimum weight set for a specific target speaker is estimated using only speech data of the target speaker without any linguistic restrictions. We evaluate the performance of EVC by a spectral distortion measure. Experimental results demonstrate that EVC works very well even if we use only a few utterances of the target speaker for the weight estimation.
Presenting complex information in an understandable manner using speech is a challenging task to do well. Significant limitations, both in the generation process and from the human listeners capabilities, typically make for poorly understood speech. This work examines possible strategies for producing understandable spoken complex information working within those limitations, as well as identifying ways to improve systems to reduce the limitations impact. We discuss a simple user study that explores these strategies with complex structured information, and describe a spoken dialog system that will make use of this work to provide a speech interface to structured information in a more understandable manner.