| Total: 12
This paper describes a new and improved method for the framework of structure to speech conversion we previously proposed. Most of the speech synthesizers take a phoneme sequence as input and generate speech by converting each of the phonemes into its corresponding sound. In other words, they simulate a human process of reading text out. However, infants usually acquire speech communication ability without text or phoneme sequences. Since their phonemic awareness is very immature, they can hardly decompose an utterance into a sequence of phones or phonemes. As developmental psychology claims, infants acquire the holistic sound patterns of words from the utterances of their parents, called word Gestalt, and they reproduce them with their vocal tubes. This behavior is called vocal imitation. In our previous studies, the word Gestalt was defined physically and a method of extracting it from a word utterance was proposed. We already applied the word Gestalt to ASR, CALL, and also speech generation, which we call structure to speech conversion. Unlike reading machines, our framework simulates infants vocal imitation. In this paper, a method for improving our speech generation framework based on a structural cost function is proposed and evaluated.
In this paper, we present our efforts towards deriving vocal tract shapes from ElectroMagnetic Articulograph data (EMA) via geometric adaptation and matching. We describe a novel approach for adapting Maedas geometric model of the vocal tract to one speaker in the MOCHA database. We show how we can rely solely on the EMA data for adaptation. We present our search technique for the vocal tract shapes that best fit the given EMA data. We then describe our approach of synthesizing speech from these shapes. Results on Mel-cepstral distortion reflect improvement in synthesis over the approach we used before without adaptation.
As part of ongoing research towards integrating an articulatory synthesizer into a text-to-speech (TTS) framework, a corpus of German utterances recorded with electromagnetic articulography (EMA) is resynthesized to provide training data for statistical models. The resynthesis is based on a measure of similarity between the original and resynthesized EMA trajectories, weighted by articulatory relevance. Preliminary results are discussed and future work outlined.
We present a new speech synthesizer class, named KlattGrid, for the Praat program [3]. This synthesizer is based on the original description of Klatt [1, 2]. New aspects of a KlattGrid in comparison with other Klatt-type synthesizers are that a KlattGrid is not frame-based but time-based. You specify parameters as a function of time with any precision you like. has no limitations on the number of oral formants, nasal formants, nasal antiformants, tracheal formants or tracheal antiformants that can be defined. has separate formants for the frication part. allows varying the form of the glottal flow function as a function of time. allows for any number of formants and bandwidths to be modified during the open phase of the glottis. uses no beforehand quantization of amplitude parameters. is fully integrated into the freely available speech analysis program Praat [3].
This work provides a method that can be used to build an English TTS for a population who speak a dialect which is not defined and for which no resources exist, by showing how a Text to Speech System (TTS) was developed for the English dialect spoken in Kenya. To begin with, the existence of a unique English dialect which had not previously been defined was confirmed from the need by the English speaking Kenyan population to have a TTS in an accent different from the British accent. This dialect is referred to here and has also been branded as Kenyan English. Given that building a TTS requires language features to be adequately defined, it was necessary to develop the essential features of the dialect such as the phoneset and the lexicon and then verifying their correctness. The paper shows how it was possible to come up with a systematic approach for defining these features through tracing the evolution of the dialect. It also discusses how the TTS was built and tested.
We propose a method for concatenative speech synthesis that permits to obtain a better matching between the logF0 and duration predicted by the prosody module and the waveform generation back-end. The proposed method is based upon our previous multilevel parametric F0 model and Toshibas plural unit selection and fusion synthesizer. The method adds a feedback loop from the back-end into the prosody module so that the prosodical information of the selected units is used to re-estimate new prosody values. The feedback loop defines a frame-level prosody model which consists of the average value and variance of the duration and logF0 of the selected units. The log-likelihood defined by this model is added to the log-likelihood of the prosody model. From the maximization of this total log-likelihood, we obtain the prosody values that produce the optimum compromise between the distortion introduced by F0 discontinuities and the one created by the prosody adjusting signal processing.
This paper describes work in progress concerning the adequate modeling of fast speech in unit selection speech synthesis systems, mostly having in mind blind and visually impaired users. Initially, a survey of the main characteristics of fast speech will be given. Subsequently, strategies for fast speech production will be discussed. Certain requirements concerning the ability of a speaker of a fast speech unit selection inventory are drawn. The following section deals with a perception study where a selected speakers ability to speak fast is investigated. To conclude, a preliminary perceptual analysis of the recordings for the speech synthesis corpus is presented.
Synthesized speech can be largely degraded in noise, resulting in compromised speech quality. In this paper, we propose a unit selection based speech synthesis system for better speech quality under poor channel conditions. First, the measurement of speech intelligibility is incorporated in the cost function as a searching criterion for unit selection. Next, the prosody of the selected units is modified according to the Lombard effect. Prosody modification includes increasing the amplitude of unvoiced phoneme and enlarging the speech duration. Finally, the FIR equalization via convex optimization is applied to reduce signal distortion due to the channel effect. Listening test in our experiments shows that the quality level of synthetic speech can be improved under poor channel conditions with the help of our proposed synthesis system.
Unit selection text-to-speech systems currently produce very natural synthetic sentences by concatenating speech segments from a large database. Recently, increasing demand for designing high quality voices with less data creates need for further optimization of the textual corpus recorded by the speaker. The optimization process of this corpus is traditionally guided by the coverage rate of well-known units: triphones, words . Such units are however not dedicated to concatenative speech synthesis; they are of general use in speech technologies and linguistics. In this paper, we describe a new unit which takes account of concatenative TTS own features: the "vocalic sandwich." Both an objective and a perceptual evaluation tend to show that vocalic sandwiches are appropriate units for corpus design.
For speech synthesizers, enhanced diversity and improved quality of synthesized speech are required. Speaker interpolation and voice conversion are the techniques that enhance diversity. The PUSF (plural unit selection and fusion) method, which we have proposed, generates synthesized waveforms using pitch-cycle waveforms. However, it is difficult to modify its spectral features while keeping naturalness of synthesized speech. In the present work, we investigated how best to represent speech waveforms. Firstly, we introduce a method that decomposes a pitch waveform in a voiced portion into a periodic component, which is excited by vocal sound source, and an aperiodic component, which is excited by noise source. Moreover, we introduce the FWF (formant waveform) model to represent the periodic component. Because the FWF model represents the pitch waveform in accordance with formant parameters, it can control the formant parameters independently. We realized a method that can easily be applied to the diversity-enhancing techniques in the PUSF-based method because this model is based on vocal tract features.
In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This process is resource intensive and potentially sub-optimal. In this paper we investigate how acoustic clustering, together with lexicon constraints, can be used to build a self-organised inventory. Six English speech synthesis systems were built using two frameworks, unit selection and parametric HTS for three inventory conditions: 1) a traditional phone set, 2) a system using orthographic units, and 3) a self-organised inventory. A listening test showed a strong preference for the classic system, and for the orthographic system over the self-organised system. Results also varied by letter to sound complexity and database coverage. This suggests the self-organised approach failed to generalise pronunciation as well as introducing noise above and beyond that caused by orthographic sound mismatch.
This paper proposes a context-dependent additive acoustic modelling technique and its application to logarithmic fundamental frequency (log F0) modelling for HMM-based speech synthesis. In the proposed technique, mean vectors of state-output distributions are composed as the weighted sum of decision tree-clustered context-dependent bias terms. Its model parameters and decision trees are estimated and built based on the maximum likelihood (ML) criterion. The proposed technique has the potential to capture the additive structure of log F0 contours. A preliminary experiment using a small database showed that the proposed technique yielded encouraging results.