INTERSPEECH.2012 - Speech Synthesis

| Total: 44

#1 Synthetic F0 can effectively convey speaker ID in delexicalized speech [PDF] [Copy] [Kimi1] [REL]

Authors: Eric Morley, Esther Klabbers, Jan P. H. van Santen, Alexander Kain, Seyed Hamidreza Mohammadi

We investigate the extent to which F0 can convey speaker ID in the absence of spectral, segmental, and durational information. We propose two methods of F0 synthesis based on the Linear Alignment Model (LAM, van Santen 2000): one parametric, the other corpus-based. Through a perceptual experiment, we show that F0 alone is able to convey information about speaker ID. We find that F0 synthesized with either LAMbased method conveys speaker ID almost as effectively as natural F0.


#2 Evaluating prosodic processing for incremental speech synthesis [PDF1] [Copy] [Kimi1] [REL]

Authors: Timo Baumann, David Schlangen

Incremental speech synthesis (iSS) accepts input and produces output in consecutive chunks that only together result in a full utterance. Systems that use iSS thus have the ability to adapt their utterances while they are ongoing. Having available less than the full utterance to plan the acoustic realisation has downsides, however, as global optimisation is not possible anymore. In this paper we present a strategy for incrementalizing the symbolic pre-processing component of speech synthesis and assess the influence of a reduction in "lookahead", i. e. in knowledge about the rest of the utterance, on prosodic quality. We found that high quality incremental output can be achieved even with a lookahead of slightly less than one phrase, allowing for timely system reaction.


#3 Expressing speaker's intentions through sentence-final intonations for Japanese conversational speech synthesis [PDF] [Copy] [Kimi1] [REL]

Authors: Kazuhiko Iwata, Tetsunori Kobayashi

In this study, we investigated speaker's intentions that the listeners perceive through subtly different sentence-final intonations. Approximately 2,000 sentence utterances were recorded and the fundamental frequency (F0) contours at the last vowel of those sentences were classified through one of the standard clustering algorithms. There found various F0 contours, namely, not only simple rising and falling intonations but also rise-fall and fall-rise intonations. In order to reveal the relationship between the intonation and the intentions, 10 representative contours were selected on the basis of the results of the clustering. Using the selected contours, a subjective evaluation was conducted. Six Japanese sentences that could have different meanings according to the sentence-final intonations were synthesized and the F0 contour at the last vowel of each sentence was replaced with the contours. The results of the evaluation by nine listeners showed that, for example, a certain falling intonation could express the intention of the econvictionf and another one that slightly differ in the shape could convey edoubt.f It was found that the subtle difference in the sentence-final F0 shape conveyed various nuances and connotations.


#4 Modeling pause-duration for style-specific speech synthesis [PDF] [Copy] [Kimi1] [REL]

Authors: Alok Parlikar, Alan W. Black

A major contribution to speaking style comes from both the location of phrase breaks in an utterance, as well as the duration of these breaks. This paper is about modeling the duration of style specific breaks. We look at six styles of speech here. We present analysis that shows that these styles differ in the duration of pauses in natural speech. We have built CART models to predict the pause duration in these corpora and have integrated them into the Festival speech synthesis system. Our objective results show that if we have sufficient training data, we can build style specific models. Our subjective tests show that people can perceive the difference between different models and that they prefer style specific models over simple pause duration models.


#5 Enumerating differences between various communicative functions for purposes of Czech expressive speech synthesis in limited domain [PDF] [Copy] [Kimi1] [REL]

Author: Martin Gruber

This paper deals with determination of a penalty matrix that should represent differences between various communicative functions. These are supposed to describe expressivity that can occur in expressive speech and were designed to fit a limited domain of conversations between seniors and a computer on a given topic. The penalty matrix is assumed to increase a rate of the expressivity perception in synthetic speech produced by unit selection method. It should reflect both acoustic differences and differences based on human perception of expressivity.


#6 Quality analysis of macroprosodic F0 dynamics in text-to-speech signals [PDF] [Copy] [Kimi1] [REL]

Authors: Christoph R. Norrenbrock, Florian Hinterleitner, Ulrich Heute, Sebastian Möller

We present a study on the relation between fundamental frequency (F0) and its perceptual effect in the context of text-to-speech (TTS) synthesis. Features that essentially capture the intonational (macro-prosodic) properties of spoken speech are introduced and analysed with regard to the following questions: (i) How does the prosodic variation of TTS signals differ from natural speech? (ii) Is there a functional relationship between the prosodic variation of TTS signals and its perceived quality? In answering these questions we present novel approaches for the construction of non-intrusive quality estimators. The results reveal a substantial degree of systematic influence of prosodic variation on TTS quality.


#7 Improved automatic extraction of generation process model commands and its use for generating fundamental frequency contours for training HMM-based speech synthesis [PDF] [Copy] [Kimi1] [REL]

Authors: Hiroya Hashimoto, Keikichi Hirose, Nobuaki Minematsu

Generation process model of fundamental frequency (F0) contours can well represent F0 movements of speech keeping a clear relation with back-grounding linguistic information of utterances. Therefore, by using the model, improvement of HMM-based speech synthesis is expected. One of major problems preventing the use of the model is that the performance of automatic extraction of the model parameters from observed F0 contours is still rather limited. A new method of automatic extraction was developed. Its algorithm is inspired from how humans do, and extracts phrase components first, while conventional methods extract accent component first. Also the method uses linguistic information of texts, which is the same as that used in HMM-based speech synthesis. A significant improvement of extraction is realized. Using the method, the model parameters are extracted for the speech corpus of HMM training, and F0 contours generated by the model are used for the HMM training instead of the original F0 contours. Listening experiment of synthetic speech indicates improvements in speech quality.


#8 Discontinuous observation HMM for prosodic-event-based F0 generation [PDF] [Copy] [Kimi1] [REL]

Authors: Tomoki Koriyama, Takashi Nose, Takao Kobayashi

This paper examines F0 modeling and generation techniques for spontaneous speech synthesis. In the previous study, we proposed a prosodic-unit HMM where the synthesis unit is defined as a segment between two prosodic events represented by a ToBI label framework. To take the advantage of the prosodic-unit HMM, continuous F0 sequences must be modeled from discontinuous F0 data including unvoiced regions. The conventional F0 models such as the MSD-HMM and the continuous F0 HMM are not always appropriate for such demand. To overcome this problem, we propose an alternative F0 model named discontinuous observation HMM (DO-HMM) where the unvoiced frames are regarded as missing data. We objectively evaluate the performance of the DO-HMM by comparing it with the conventional F0 modeling techniques and discuss the results.


#9 Hierarchical English emphatic speech synthesis based on HMM with limited training data [PDF] [Copy] [Kimi1] [REL]

Authors: Fanbo Meng, Zhiyong Wu, Helen Meng, Jia Jia, Lianhong Cai

Emphasis is an important form of expressiveness in speech. Hidden Markov model (HMM) based synthesis has shown great flexibility in generating expressive speech. This paper proposes a hierarchical model based on HMM aiming at synthesizing emphatic speech of both high emphasis quality and high naturalness with limited data. The decision tree (DT) is constructed with non-emphasis-questions using both neutral and emphasis corpora. We classify the data in each leaf of the DT into 6 emphasis categories according to the emphasis-related questions. The data of the same emphasis category are grouped into one sub-node and are used to train one HMM. As there might be no data of some specific emphasis categories in the leaves of the DT, a method based on the cost calculation is proposed to select a suitable HMM trained from the data of other sub-node in the same leaf for predicting parameters. Further a compensation model is proposed to adjust the predicted parameters. Experiments show that the proposed hierarchical model can synthesize emphatic speech with high quality for both naturalness and emphasis, using limited amount of training data.


#10 Employing sentence structure: syntax trees as prosody generators [PDF] [Copy] [Kimi1] [REL]

Authors: Sarah Hoffmann, Beat Pfister

In this paper, we describe a prosody generation system for speech synthesis that makes direct use of syntax trees to obtain duration and pitch. Instead of transforming the tree through special rules or extracting isolated features from the tree, we make use of the tree structure itself to construct a superpositional model that is able to learn the relation between syntax and prosody. We implemented the system in our SVOX text-to-speech system and evaluated it against the existing rule-based system. Informal listening tests showed that structural information from the tree is carried over to the prosody.


#11 A stochastic model of singing voice F0 contours for characterizing expressive dynamic components [PDF] [Copy] [Kimi1] [REL]

Authors: Yasunori Ohishi, Hirokazu Kameoka, Daichi Mochihashi, Kunio Kashino

We present a novel stochastic model of singing voice fundamental frequency (F0) contours for characterizing expressive dynamic components, such as vibrato and portamento. Although dynamic components can be important features for any singing voice applications, modeling and extracting these components from a raw F0 contour have yet to be accomplished. Therefore, we describe a process for generating dynamic components explicitly and represent the process as a stochastic model. Then we develop an algorithm for estimating the model parameters based on statistical techniques. Experimental results show that our method successfully extracts the expressive components from raw F0 contours.


#12 Text-to-speech intelligibility across speech rates [PDF] [Copy] [Kimi1] [REL]

Authors: Ann K. Syrdal, H. Timothy Bunnell, Susan R. Hertz, Taniya Mishra, Murray Spiegel, Corine Bickley, Deborah Rekart, Matthew J. Makashay

A web-based listening test measured intelligibility across speech rate of 8 TTS systems and a linearly time-compressed human speech reference voice. Four synthesis methods were compared: formant, diphone concatenation, unit selection concatenation, and HMM synthesis. For each TTS method, a female and a male American English voice from each of 2 independent synthesis engines were tested. Semantically unpredictable sentences were presented at 6 speech rates from 200 to 450 words per minute. In an open response format, listeners typed what they heard. Listener transcriptions were automatically scored at the word level, and a normalized edit distance per speech rate was calculated for each of 355 listeners. There were significant differences among the TTS systems. The two unit selection TTS systems were the most intelligible across speech rates; one was equivalent to human speech. Listeners' native language, TTS familiarity, and audio equipment were also significant factors.


#13 Objective intelligibility assessment of text-to-speech system using template constrained generalized posterior probability [PDF] [Copy] [Kimi1] [REL]

Authors: Linfang Wang, Lijuan Wang, Yan Teng, Zhe Geng, Frank K. Soong

Speech intelligibility is one of the most important measures in evaluating text-to-speech (TTS) synthesizer. For fast comparing, developing, and deploying TTS systems, automatic objective intelligibility measurement is desired, as human listening test is label intensive, inconsistent, and with expensive cost. In this work, we propose an automatic objective intelligibility measure for synthesized speech using template constrained generalized posterior probability (TCGPP). TCGPP is a posterior probability based confidence measure, which has the advantage to identify errors in synthesized speech at small granularity level. Moreover, the TCGPP scores over a testing set can be summarized into an overall objective intelligibility metric to compare two synthesizers, or rank multiple TTS systems. We conducted the experiments using the synthesized test sentences from all the participants of EH1 English task in Blizzard Challenge 2010. The results show the proposed measure has high correlation (corr=0.9) with subjective scores and ranking.


#14 Mel cepstral coefficient modification based on the glimpse proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise [PDF] [Copy] [Kimi1] [REL]

Authors: Cassia Valentini-Botinhao, Junichi Yamagishi, Simon King

We propose a method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise. This method is based on an approximation we previously proposed for the Glimpse Proportion measure. Here we show how to update the Mel cepstral coefficients using this measure as an optimization criterion and how to control the amount of distortion by limiting the frequency resolution of the modifications. To evaluate the method we built eight different voices from normal read-text speech data from a male speaker. Some voices were also built from Lombard speech data produced by the same speaker. Listening experiments with speech-shaped noise and with a single competing talker indicate that our method significantly improves intelligibility when compared to unmodified synthetic speech. The voices built from Lombard speech outperformed the proposed method particularly for the competing talker case. However, compared to a voice using only the spectral parameters from Lombard speech, the proposed method obtains similar or higher performance.


#15 Speech-in-noise intelligibility improvement based on spectral shaping and dynamic range compression [PDF] [Copy] [Kimi1] [REL]

Authors: Tudor-Catalin Zorila, Varvara Kandia, Yannis Stylianou

In this paper, we suggest a non-parametric way to improve the intelligibility of speech in noise. The signal is enhanced before presented in a noisy environment, under the constraint of equal global signal power before and after modifications. Two systems are combined in a cascade form to enhance the quality of the signal first in frequency (spectral shaping) and then in time (dynamic range compression). Experiments with speech shaped (SSN) and competed speaker (CS) types of noise at various low SNR values, show that the suggested approach outperforms state-of-the art methods in terms of the Speech Intelligibility Index (SII). In terms of SNR gain there is an improvement of 4 dB (SSN) and 8 dB (CS). A large formal listening test confirm the efficiency of the suggested system in enhancing speech intelligibility in noise.


#16 Implementation of simple spectral techniques to enhance the intelligibility of speech using a harmonic model [PDF] [Copy] [Kimi1] [REL]

Authors: Daniel Erro, Yannis Stylianou, Eva Navas, Inma Hernáez

We have designed a system that increases the intelligibility of speech signals in noise by manipulating the parameters of a harmonic speech model. The system performs the transformation in two steps: in the first step, it modifies the spectral slope, which is closely related to the vocal effort; in the second step, it amplifies low-energy parts of the signal using dynamic range compression techniques. Objective and subjective measures involving speech-shaped noise confirm the effectiveness of these simple methods. As the harmonic model has been used in previous works to implement the waveform generation module of high-quality statistical synthesizers, the system presented here can provide the synthesis engine with a higher degree of control on the intelligibility of the resulting artificial speech.


#17 Making conversational vowels more clear [PDF] [Copy] [Kimi] [REL]

Authors: Seyed Hamidreza Mohammadi, Alexander Kain, Jan P. H. van Santen

Previously, it has been shown that using clear speech short-term spectra improves the intelligibility of conversational speech. In this paper, a speech transformation method is used to map the spectral features of conversational speech to resemble clear speech. A joint-density Gaussian mixture model is used as the mapping function. The transformation is studied in both the formant frequency and the line spectral frequency domains. Listening test results show that in noisier environments, the transformed speech signal improves vowel intelligibility signif- icantly compared to the original conversational speech. There is also an increase in vowel intelligibility in less noisy environ- ments, but the increase is not statistically significant. The significance tests are performed by Tukey comparison and planned one-tail t-test methods.


#18 Exploring rich expressive information from audiobook data using cluster adaptive training [PDF] [Copy] [Kimi1] [REL]

Authors: Langzhou Chen, Mark J. F. Gales, Vincent Wan, Javier Latorre, Masami Akamine

Audiobook data is a freely available source of rich expressive speech data. To accurately generate speech of this form, expressiveness must be incorporated into the synthesis system. This paper investigates two parts of this process: the representation of expressive information in a statistical parametric speech synthesis system; and whether discrete expressive state labels can sufficiently represent the full diversity of expressive speech. Initially a discrete form of expressive information was used. A new form of expressive representation, where each condition maps to a point in an expressive speech space, is described. This cluster adaptively trained (CAT) system is compared to incorporating information in the decision tree construction and a transform based system using CMLLR and CSMAPLR. Experimental results indicate that the CAT system outperformed the contrast systems in both expressiveness and voice quality. The CAT-style representation yields a continuous expressive speech space. Thus, it is possible to treat utterance-level expressiveness as a point in this continuous space, rather than as one of a discrete states. This continuous-space representation outperformed discrete clusters, indicating limitations of discrete labels for expressiveness in audiobook data.


#19 Turning a monolingual speaker into multilingual for a mixed-language TTS [PDF] [Copy] [Kimi1] [REL]

Authors: Ji He, Yao Qian, Frank K. Soong, Sheng Zhao

We propose an approach to render speech sentences of different languages out of a speaker's monolingual recordings for building mixed-coded TTS systems. The difference between two monolingual speakers' corpora, e.g. English and Chinese, are firstly equalized by warping spectral frequency, removing F0 variation and adjusting speaking rate across speakers and languages. The English speaker's Chinese speech is then rendered by a trajectory tilling approach. The Chinese speaker's parameter trajectories, which are equalized towards English speaker, are used to guide the search for the best sequence of 5ms waveform "tiles" in English speaker's recordings. The rendered English speaker's Chinese speech together with her own English recordings is finally used to train a mixed-language (English-Chinese) HMM-based TTS. Experimental results show that the proposed approach can synthesize high quality of mixed-language speech, which is also confirmed in both objective and subjective evaluations.


#20 Using HMM-based speech synthesis to reconstruct the voice of individuals with degenerative speech disorders [PDF] [Copy] [Kimi] [REL]

Authors: Christophe Veaux, Junichi Yamagishi, Simon King

When individuals lose the ability to produce their own speech, due to degenerative diseases such as motor neuron disease (MND) or Parkinson's, they lose not only a functional means of communication but also a display of their individual and group identity. In order to build personalized synthetic voices, attempts have been made to capture the voice before it is lost, using a process known as voice banking. But, for some patients, the speech deterioration frequently coincides or quickly follows diagnosis. Using HMM-based speech synthesis, it is now possible to build personalized synthetic voices with minimal data recordings and even disordered speech. In this approach, the patient's recordings are used to adapt an average voice model pre-trained on many speakers. The structure of the voice model allows some reconstruction of the voice by substituting some components from the average voice in order to compensate for the disorders found in the patient's speech. In this paper, we compare different substitution strategies and introduce a context-dependent model substitution to improve the intelligibility of the synthetic speech while retaining the vocal identity of the patient. A subjective evaluation of the reconstructed voice for a patient with MND shows promising results for this strategy.


#21 Speech factorization for HMM-TTS based on cluster adaptive training [PDF] [Copy] [Kimi] [REL]

Authors: Javier Latorre, Vincent Wan, Mark J. F. Gales, Langzhou Chen, K. K. Chin, Kate Knill, Masami Akamine

This paper presents a novel approach to factorize and control different speech factors in HMM-based TTS systems. In this paper cluster adaptive training (CAT) is used to factorize speaker identity and expressiveness (i.e. emotion). Within a CAT framework, each speech factor can be modelled by a different set of clusters. Users can control speaker identity and expressiveness independently by modifying the weights associated with each set. These weights are defined in a continuous space, so variations of speaker and emotion are also continuous. Additionally, given a speaker which has only neutral-style training data, the approach is able to synthesise speech with that speaker's voice and different expressions. Lastly, the paper discusses how generalization of the basic factorization concept could allow the production of expressive speech from neutral voices for other HMM-TTS systems not based on CAT.


#22 Factored MLLR adaptation algorithm for HMM-based expressive TTS [PDF] [Copy] [Kimi1] [REL]

Authors: June Sig Sung, Doo Hwa Hong, Hyun Woo Koo, Nam Soo Kim

One of the most popular approaches to parameter adaptation in hidden Markov model (HMM) based systems is the maximum likelihood linear regression (MLLR) technique. In our previous work, we proposed factored MLLR (FMLLR) where an MLLR parameter is defined as a function of a control parameter vector. We presented a method to train the FMLLR parameters based on a general framework of the expectation-maximization (EM) algorithm. To show the effectiveness, we applied the FMLLR to adapt the spectral envelope feature of the reading-style speech to those of the singing voice. In this paper, we apply the FMLLR to the HMM-based expressive speech synthesis task and compare its performance with conventional approaches. In a series of experimental results, the FMLLR shows better performance than conventional methods.


#23 Speaker-adaptive visual speech synthesis in the HMM-framework [PDF] [Copy] [Kimi1] [REL]

Authors: Dietmar Schabus, Michael Pucher, Gregor Hofer

In this paper we apply speaker-adaptive and speaker-dependent training of hidden Markov models (HMMs) to visual speech synthesis. In speaker-dependent training we use data from one speaker to train a visual and acoustic HMM. In speaker-adaptive training, first a visual background model (average voice) from multiple speakers is trained. This background model is then adapted to a new target speaker using (a small amount of) data from the target speaker. This concept has been successfully applied to acoustic speech synthesis. This paper demonstrates how model adaption is applied to the visual domain to synthesize animations of talking faces. A perceptive evaluation is performed, showing that speaker-adaptive modeling outperforms speaker-dependent models for small amounts of training / adaptation data.


#24 Cross-lingual speaker adaptation for HMM-based speech synthesis based on perceptual characteristics and speaker interpolation [PDF] [Copy] [Kimi1] [REL]

Authors: Viviane de Franca Oliveira, Sayaka Shiota, Yoshihiko Nankaku, Keiichi Tokuda

The language mapping performed in a cross-lingual speaker adaptation task may not produce sufficient results if a bilingual database is not available. In order to overcome this problem, this work proposes a new method in which a correspondence between speakers in two different databases, speaking different languages, is established based on the perceptual characteristics of their voices. The proposed approach uses a language-independent space of voice characteristics obtained by performing subjective listening tests. This new space is used in the speaker adaptation process, making it possible to represent the input speaker in a different language while keeping his/her voice characteristics, without a bilingual database. Furthermore, the method is potentially able to adapt the prosodic information from the target speaker, such as long-term changes in F0 and durations. From the evaluation listening tests, we confirmed that the proposed framework generates speech that sounds similar to the target speaker voice, with better speech quality than the previously proposed method.


#25 C2h: a computational model of H&h-based phonetic contrast in synthetic speech [PDF] [Copy] [Kimi1] [REL]

Authors: Mauro Nicolao, Javier Latorre, Roger K. Moore

This paper presents a computational model of human speech production based on the hypothesis that low energy attractors for a human speech production system can be identified, and that interpolation/extrapolation along the key dimension of hypo/hyper-articulation can be motivated by energetic considerations of phonetic contrast. An HMM-based speech synthesiser along with continuous adaptation of its statistical models was used to implement the model. Two adaptation methods were proposed for vowel and consonant models and their effectiveness was tested by showing that such hypo/hyper-articulation control can manipulate successfully the intelligibility of synthetic speech in noise. Objective evaluations with the ANSI Speech Intelligibility Index indicate that intelligibility in various types of noise is effectively controlled. In particular, in the hyper-articulation transforms, the improvement with respect to unadapted speech is above 25 %.