INTERSPEECH.2024 - Speech Synthesis | Cool Papers

#1 Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals [PDF] [Copy] [Kimi³] [REL]

Authors: Kentaro Seki ; Shinnosuke Takamichi ; Norihiro Takamune ; Yuki Saito ; Kanami Imamura ; Hiroshi Saruwatari

This paper proposes a new task called spatial voice conversion, which aims to convert a target voice while preserving spatial information and non-target signals. Traditional voice conversion methods focus on single-channel waveforms, ignoring the stereo listening experience inherent in human hearing. Our baseline approach addresses this gap by integrating blind source separation (BSS), voice conversion (VC), and spatial mixing to handle multi-channel waveforms. Through experimental evaluations, we organize and identify the key challenges inherent in this task, such as maintaining audio quality and accurately preserving spatial information. Our results highlight the fundamental difficulties in balancing these aspects, providing a benchmark for future research in spatial voice conversion. The proposed method's code is publicly available to encourage further exploration in this domain.

#2 Neural Codec Language Models for Disentangled and Textless Voice Conversion [PDF²] [Copy] [Kimi²] [REL]

Authors: Alan Baade ; Puyuan Peng ; David Harwath

We introduce a method for textless any-to-any voice conversion based on the recent progress in speech synthesis driven by neural codec language models. To disentangle the speaker and linguistic information, we adapt a speaker normalizing procedure for discrete semantic units, and then generate with an autoregressive language model for greatly improved diversity. We further improve the similarity of the output audio to the target speaker's voice by leveraging classifier free guidance. We evaluate our techniques against current text to speech synthesis and voice conversion systems and compare the effectiveness of different neural codec language model pipelines. We demonstrate state-of-the-art results in accent disentanglement and speaker similarity for voice conversion with significantly less compute than existing codec language models such as VALL-E.

#3 Fine-Grained and Interpretable Neural Speech Editing [PDF] [Copy] [Kimi] [REL]

Authors: Max Morrison ; Cameron Churchwell ; Nathan Pruyne ; Bryan Pardo

Fine-grained editing of speech attributes - such as prosody (i.e., the pitch, loudness, and phoneme durations), pronunciation, speaker identity, and formants - is useful for fine-tuning and fixing imperfections in human and AI-generated speech recordings for creation of podcasts, film dialogue, and video game dialogue. Existing speech synthesis systems use representations that entangle two or more of these attributes, prohibiting their use in fine-grained, disentangled editing. In this paper, we demonstrate the first disentangled and interpretable representation of speech with comparable subjective and objective vocoding reconstruction accuracy to Mel spectrograms. Our interpretable representation, combined with our proposed data augmentation method, enables training an existing neural vocoder to perform fast, accurate, and high-quality editing of pitch, duration, volume, timbral correlates of volume, pronunciation, speaker identity, and spectral balance.

#4 FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation [PDF] [Copy] [Kimi] [REL]

Authors: Takuhiro Kaneko ; Hirokazu Kameoka ; Kou Tanaka ; Yuto Kondo

Diffusion-based voice conversion (VC) techniques such as VoiceGrad have attracted interest because of their high VC performance in terms of speech quality and speaker similarity. However, a notable limitation is the slow inference caused by the multi-step reverse diffusion. Therefore, we propose FastVoiceGrad, a novel one-step diffusion-based VC that reduces the number of iterations from dozens to one while inheriting the high VC performance of the multi-step diffusion-based VC. We obtain the model using adversarial conditional diffusion distillation (ACDD), leveraging the ability of generative adversarial networks and diffusion models while reconsidering the initial states in sampling. Evaluations of one-shot any-to-any VC demonstrate that FastVoiceGrad achieves VC performance superior to or comparable to that of previous multi-step diffusion-based VC while enhancing the inference speed.

#5 DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion [PDF] [Copy] [Kimi] [REL]

Authors: Ziqian Ning ; Shuai Wang ; Pengcheng Zhu ; Zhichao Wang ; Jixun Yao ; Lei Xie ; Mengxiao Bi

Streaming voice conversion has gained popularity for its applicability in real-time applications. The recently proposed DualVC 2 has successfully achieved robust and high-quality streaming voice conversion in approximately 180ms. However, DualVC 2 is based on the recognition-synthesis framework, with multi-level cascade models that cannot be jointly optimized, and faces severe performance drops with small chunks caused by the ASR encoder. To address these issues, we propose an end-to-end model DualVC 3. It incorporates K-means clustered SSL features to guide the training of the content encoder and adopts an optional language model for pseudo-content generation to improve the conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in both subjective and objective metrics, with a latency of only 50 ms. We have made our audio samples publicly available.

#6 Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity [PDF] [Copy] [Kimi] [REL]

Authors: Tianhua Qi ; Shiyan Wang ; Cheng Lu ; Yan Zhao ; Yuan Zong ; Wenming Zheng

Realistic emotional voice conversion (EVC) aims to enhance emotional diversity of converted audios, making the synthesized voices more authentic and natural. To this end, we propose Emotional Intensity-aware Network (EINet), dynamically adjusting intonation and rhythm by incorporating controllable emotional intensity. To better capture nuances in emotional intensity, we go beyond mere distance measurements among acoustic features. Instead, an emotion evaluator is utilized to precisely quantify speaker’s emotional state. By employing an intensity mapper, intensity pseudo-labels are obtained to bridge the gap between emotional speech intensity modeling and run-time conversion. To ensure high speech quality while retaining controllability, an emotion renderer is used for combining linguistic features smoothly with manipulated emotional features at frame level. Furthermore, we employ a duration predictor to facilitate adaptive prediction of rhythm changes condition on specifying intensity value. Experimental results show EINet’s superior performance in naturalness and diversity of emotional expression compared to state-of-the-art EVC methods.

#7 Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Jinlong Xue ; Yayue Deng ; Yicheng Han ; Yingming Gao ; Ya Li

Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they only support short speech prompts and cannot leverage longer context information, as required in audiobook and conversational TTS scenarios. In this paper, we introduce a novel audio codec-based TTS model to adapt context features with multiple enhancements. Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer (MMCE-Qformer) to utilize additional multi-modal context information. Besides, we adapt a pretrained LLM to leverage its understanding ability to predict semantic tokens, and use a SoundStorm to generate acoustic tokens thereby enhancing audio quality and speaker similarity. The extensive objective and subjective evaluations show that our proposed method outperforms baselines across various context TTS scenarios.

#8 An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS [PDF¹] [Copy] [Kimi²] [REL]

Authors: Xiaofei Wang ; Sefik Emre Eskimez ; Manthan Thakker ; Hemin Yang ; Zirun Zhu ; Min Tang ; Yufei Xia ; Jinzhu Li ; Sheng Zhao ; Jinyu Li ; Naoyuki Kanda

Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker’s voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audio generated from noisy audio prompts within the context of flow-matching-based zero-shot TTS. Our investigation includes comprehensive training strategies: unsupervised pre-training with masked speech denoising, multi-speaker detection and DNSMOS-based data filtering on the pre-training data, and fine-tuning with random noise mixing. The results of our experiments demonstrate significant improvements in intelligibility, speaker similarity, and overall audio quality compared to the approach of applying speech enhancement to the audio prompt.

#9 Lightweight Zero-shot Text-to-Speech with Mixture of Adapters [PDF¹] [Copy] [Kimi²] [REL]

Authors: Kenichi Fujita ; Takanori Ashihara ; Marc Delcroix ; Yusuke Ijima

The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40% of parameters at 1.9 times faster inference speed.

#10 DINO-VITS: Data-Efficient Zero-Shot TTS with Self-Supervised Speaker Verification Loss for Noise Robustness [PDF¹] [Copy] [Kimi] [REL]

Authors: Vikentii Pankov ; Valeria Pronina ; Alexander Kuzmin ; Maksim Borisov ; Nikita Usoltsev ; Xingshan Zeng ; Alexander Golubkov ; Nikolai Ermolenko ; Aleksandra Shirshova ; Yulia Matveeva

We address zero-shot TTS systems' noise-robustness problem by proposing a dual-objective training for the speaker encoder using self-supervised DINO loss. This approach enhances the speaker encoder with the speech synthesis objective, capturing a wider range of speech characteristics beneficial for voice cloning. At the same time, the DINO objective improves speaker representation learning, ensuring robustness to noise and speaker discriminability. Experiments demonstrate significant improvements in subjective metrics under both clean and noisy conditions, outperforming traditional speaker-encoder-based TTS systems. Additionally, we explore training zero-shot TTS on noisy, unlabeled data. Our two-stage training strategy, leveraging self-supervised speech models to distinguish between noisy and clean speech, shows notable advances in similarity and naturalness, especially with noisy training datasets, compared to the ASR-transcription-based approach.

#11 Universal Score-based Speech Enhancement with High Content Preservation [PDF] [Copy] [Kimi] [REL]

Authors: Robin Scheibler ; Yusuke Fujita ; Yuma Shirahata ; Tatsuya Komatsu

We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifications to the network architecture, improving training stability and final performance. Second, we introduce an adversarial loss to promote learning high quality speech features. Third, we propose a low-rank adaptation scheme with a phoneme fidelity loss to improve content preservation in the enhanced speech. In the experiments, we train a universal enhancement model on a large scale dataset of speech degraded by noise, reverberation, and various distortions. The results on multiple public benchmark datasets demonstrate that UNIVERSE++ compares favorably to both discriminative and generative baselines for a wide range of qualitative and intelligibility metrics.

#12 Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens [PDF] [Copy] [Kimi¹] [REL]

Authors: Haici Yang ; Jiaqi Su ; Minje Kim ; Zeyu Jin

We present a high-fidelity generative speech enhancement model, Genhancer, which generates clean speech as discrete codec tokens while conditioning on the input speech features. Discrete codec tokens provide an efficient latent domain in place of the conventional time or time-frequency domain of signals, so as to enable complex modeling of speech and allow generative modeling to enforce speaker consistency and content continuity. We provide insights into the best-fit generation scheme for enhancement among parallel prediction, auto-regression, and masking to demonstrate the benefits of conditioning on both pre-trained and jointly learned speech features. Subjective and objective tests show that Genhancer significantly improves audio quality and speaker-identity retention over the SOTA baselines, including conventional and generative ones while preserving content accuracy. Audio samples and supplement materials are available at https://minjekim.com/research-projects/genhancer

#13 Schrödinger Bridge for Generative Speech Enhancement [PDF] [Copy] [Kimi¹] [REL]

Authors: Ante Jukić ; Roman Korostik ; Jagadeesh Balam ; Boris Ginsburg

This paper proposes a generative speech enhancement model based on Schrodinger bridge (SB). The proposed model is employinga tractable SB to formulate a data-to-data process between the clean speech distribution and the observed noisy speech distribution. The model is trained with a data prediction loss, aiming to recover the complex-valued clean speech coefficients, and an auxiliary time-domain loss is used to improve training of the model. The effectiveness of the proposed SB-based model is evaluated in two different speech enhancement tasks: speech denoising and speech dereverberation. The experimental results demonstrate that the proposed SB-based outperforms diffusion-based models in terms of speech quality metrics and ASR performance, e.g., resulting in relative word error rate reduction of 20% for denoising and 6% for dereverberation compared to the best baseline model. The proposed model also demonstrates improved efficiency, achieving better quality than the baselines for the same number of sampling steps and with a reduced computational cost.

#14 Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge [PDF] [Copy] [Kimi] [REL]

Authors: Thanapat Trachu ; Chawan Piansaddhayanon ; Ekapol Chuangsuwanich

Diffusion-based speech enhancement has shown promising results, but can suffer from a slower inference time. Initializing the diffusion process with the enhanced audio generated by a regression-based model can be used to reduce the computational steps required. However, these approaches often necessitate a regression model, further increasing the system's complexity. We propose Thunder, a unified regression-diffusion model that utilizes the Brownian bridge process which can allow the model to act in both modes. The regression mode can be accessed by setting the diffusion time step closed to 1. However, the standard score-based diffusion modeling does not perform well in this setup due to gradient instability. To mitigate this problem, we modify the diffusion model to predict the clean speech instead of the score function, achieving competitive performance with a more compact model size and fewer reverse steps.

#15 Pre-training Feature Guided Diffusion Model for Speech Enhancement [PDF] [Copy] [Kimi] [REL]

Authors: Yiyuan Yang ; Niki Trigoni ; Andrew Markham

Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments, improving communication and listening experiences. In this paper, we introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement, addressing the limitations of existing discriminative and generative models. By integrating spectral features into a variational autoencoder (VAE) and leveraging pre-trained features for guidance during the reverse process, coupled with the utilization of the deterministic discrete integration method (DDIM) to streamline sampling steps, our model improves efficiency and speech enhancement quality. Demonstrating state-of-the-art results on two public datasets with different SNRs, our model outshines other baselines in efficiency and robustness. The proposed method not only optimizes performance but also enhances practical deployment capabilities, without increasing computational demands.

#16 Guided conditioning with predictive network on score-based diffusion model for speech enhancement [PDF¹] [Copy] [Kimi] [REL]

Authors: Dail Kim ; Da-Hee Yang ; Donghyun Kim ; Joon-Hyuk Chang ; Jeonghwan Choi ; Moa Lee ; Jaemo Yang ; Han-gil Moon

Although diffusion-based speech enhancement (SE) models have emerged, they exhibit lower ability in noise removal than other predictive-based SE models. This reflects a trade-off between generative models, which are capable of producing more natural speech based on estimated target distribution, and predictive models, which are more effective in noise removal. To mitigate this trade-off, we propose a novel conditioning method for score-based diffusion models. The proposed approach involves guiding the diffusion model with a pretrained predictive model without joint training, thereby enabling enhanced speech to offer the proper direction to the diffusion model. The effectiveness of the proposed method is highlighted by outperforming the baseline method, with only half the number of sampling steps.

#17 SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models [PDF] [Copy] [Kimi] [REL]

Authors: Chun Yin ; Tai-Shih Chi ; Yu Tsao ; Hsin-Min Wang

Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability.

#18 Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies [PDF] [Copy] [Kimi] [REL]

Authors: Srija Anand ; Praveen Srinivasa Varadhan ; Ashwin Sankar ; Giri Raju ; Mitesh M. Khapra

Publicly available TTS datasets for low-resource languages like Hindi and Tamil typically contain 10-20 hours of data, leading to poor vocabulary coverage. This limitation becomes evident in downstream applications where domain-specific vocabulary coupled with frequent code-mixing with English, results in many OOV words. To highlight this problem, we create a benchmark containing OOV words from several real-world applications. Indeed, state-of-the-art Hindi and Tamil TTS systems perform poorly on this OOV benchmark, as indicated by intelligibility tests. To improve the model’s OOV performance, we propose a low-effort and economically viable strategy to obtain more training data. Specifically, we propose using volunteers as opposed to high quality voice artists to record words containing character bigrams unseen in the training data. We show that using such inexpensive data, the model's performance improves on OOV words, while not affecting voice quality and in-domain performance.

#19 Assessing the impact of contextual framing on subjective TTS quality [PDF] [Copy] [Kimi] [REL]

Authors: Jens Edlund ; Christina Tånnander ; Sébastien Le Maguer ; Petra Wagner

Text-To-Speech (TTS) evaluations are habitually carried out without contextual and situational framing. Since humans adapt their speaking style to situation specific communicative needs, such evaluations may not generalize across situations. Without clearly defined framing, it is even unclear in which situations evaluation results hold at all. We test the hypothesized impact of framing on TTS evaluation in a crowdsourced MOS evaluation of four TTS voices, systematically varying (a) the intended TTS task (domestic humanoid robot, child’s voice replacement, fiction audio books and long and information-rich texts) and (b) the framing of that task. The results show that framing differentiated MOS responses, with individual TTS performance varying significantly across tasks and framings. This corroborates the assumption that decontextualized MOS evaluations do not generalize, and suggests that TTS evaluations should not be reported without the type of framing that was employed, if any.

#20 What do people hear? Listeners’ Perception of Conversational Speech [PDF] [Copy] [Kimi] [REL]

Authors: Adaeze Adigwe ; Sarenne Wallbridge ; Simon King

Conversational agents are becoming increasingly popular, prompting the need for text-to-speech (TTS) systems that sound conversational. Previous research has focused on training TTS models on elicited or found conversational speech then measuring an improved listener preference. Preference ratings cannot pinpoint why TTS voices fall short of conversational expectations, underscoring our limited understanding of conversational speaking styles. In this pilot study, we conduct interviews with naive listeners who evaluate if speech was taken from a conversation or not, then give their explanation. Our results indicate that listeners are capable of distinguishing conversational utterances from read speech from acoustic features alone. While listeners’ explanations vary, they generally allude to pronunciation, rhythmic organisation, and inappropriate prosody. Using targeted prosodic modifications to synthesise speech, we shed light on the complexity of evaluating conversational style.

#21 Uncertainty-Aware Mean Opinion Score Prediction [PDF] [Copy] [Kimi] [REL]

Authors: Hui Wang ; Shiwan Zhao ; Jiaming Zhou ; Xiguang Zheng ; Haoqin Sun ; Xuechen Wang ; Yong Qin

Mean Opinion Score (MOS) prediction has made significant progress in specific domains. However, the unstable performance of MOS prediction models across diverse samples presents ongoing challenges in the practical application of these systems. In this paper, we point out that the absence of uncertainty modeling is a significant limitation hindering MOS prediction systems from applying to the real and open world. We analyze the sources of uncertainty in the MOS prediction task and propose to establish an uncertainty-aware MOS prediction system that models aleatory uncertainty and epistemic uncertainty by heteroscedastic regression and Monte Carlo dropout separately. The experimental results show that the system captures uncertainty well and is capable of performing selective prediction and out-of-domain detection. Such capabilities significantly enhance the practical utility of MOS systems in diverse real and open-world environments.

#22 Lifelong Learning MOS Prediction for Synthetic Speech Quality Evaluation [PDF] [Copy] [Kimi] [REL]

Authors: Félix Saget ; Meysam Shamsi ; Marie Tahon

Mean Opinion Score (MOS) has been a long-standing standard for perceptive evaluation of quality of speech synthesis models; however, this criterion is hardly reproducible, and costly. Automatic, neural MOS predictors have emerged as a solution to the objective assessment of synthetic speech. These predictors are trained once on data collected from past listening tests, and thus may suffer from adaptation to new technology breakthrough in speech synthesis. In this study, we investigate the applicability of lifelong learning for MOS predictors, where the training samples would be fed to the model in the chronological order. A sequential lifelong mode and a cumulative lifelong mode have been compared with traditional batch training using the BVCC and Blizzard Challenge datasets. The experiments show the advantages of lifelong learning in cross-corpus evaluation as well as in a constrained data availability scenario.

#23 GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis [PDF] [Copy] [Kimi] [REL]

Authors: Zehua Kcriss Li ; Meiying Melissa Chen ; Yi Zhong ; Pinxin Liu ; Zhiyao Duan

Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory features mastered by professional voice actors. Inspired by this, we explore expressive speech synthesis through the lens of articulatory phonetics. Specifically, we define a framework with three dimensions: Glottalization, Tenseness, and Resonance (GTR), to guide the synthesis at the voice production level. With this framework, we record a high-quality speech dataset named GTR-Voice, featuring 20 Chinese sentences articulated by a professional voice actor across 125 distinct GTR combinations. We verify the framework and GTR annotations through automatic classification and listening tests, and demonstrate precise controllability along the GTR dimensions on two fine-tuned expressive TTS models. We open-source the dataset and TTS models.

#24 TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech [PDF] [Copy] [Kimi] [REL]

Authors: Donghyun Seong ; Hoyoung Lee ; Joon-Hyuk Chang

Expressive text-to-speech (TTS) aims to synthesize better human-like speech by incorporating diverse speech styles or emotions. While most expressive TTS models rely on reference speech to condition the style of the generated speech, they often fail to generate speech of regular quality. To ensure consistent speech quality, we propose an expressive TTS conditioned on style representation extracted from the text itself. To implement this text-based style predictor, we design a style module incorporating residual vector quantization. Furthermore, the style representation is enhanced through style-to-text alignment and a mel decoder with style hierarchical layer normalization (SHLN). Our experimental findings demonstrate that our proposed model accurately estimates style representation, enabling the generation of high-quality speech without the need for reference speech.

#25 Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Weiqin Li ; Peiji Yang ; Yicheng Zhong ; Yixuan Zhou ; Zhisheng Wang ; Zhiyong Wu ; Xixin Wu ; Helen Meng

Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating various spontaneous behaviors and capturing prosody variations in spontaneous speech. In this paper, we propose a novel spontaneous speech synthesis system based on language models. We systematically categorize and uniformly model diverse spontaneous behaviors. Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech. Experimental results show that our proposed method significantly outperforms the baseline methods in terms of prosody naturalness and spontaneous behavior naturalness.