Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis

#1 Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis [PDF²] [Copy] [Kimi¹] [REL]

Authors: Yifan Hu, Rui Liu, Yi Ren, Xiang Yin, Haizhou Li

Conversational Speech Synthesis (CSS) aims to align synthesized speech with the emotional and stylistic context of user-agent interactions to achieve empathy. Current generative CSS models face interpretability limitations due to insufficient emotional perception and redundant discrete speech coding. To address the above issues, we present Chain-Talker, a three-stage framework mimicking human cognition: Emotion Understanding derives context-aware emotion descriptors from dialogue history; Semantic Understanding generates compact semantic codes via serialized prediction; and Empathetic Rendering synthesizes expressive speech by integrating both components. To support emotion modeling, we develop CSS-EmCap, an LLM-driven automated pipeline for generating precise conversational speech emotion captions. Experiments on three benchmark datasets demonstrate that Chain-Talker produces more expressive and empathetic speech than existing methods, with CSS-EmCap contributing to reliable emotion modeling. The code and demos are available at: https://github.com/AI-S2-Lab/Chain-Talker.

Subjects: Sound , Audio and Speech Processing

Publish: 2025-05-19 01:24:52 UTC

2505.12597

#1 Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis [PDF2] [Copy] [Kimi1] [REL]

#1 Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis [PDF²] [Copy] [Kimi¹] [REL]