SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

#1 SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model [PDF²] [Copy] [Kimi] [REL]

Authors: Kaidi Wang, Yi He, Wenhao Guan, Weijie Wu, Hongwu Ding, Xiong Zhang, Di Wu, Meng Meng, Jian Luan, Lin Li, Qingyang Hong

Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.

Subjects: Audio and Speech Processing , Artificial Intelligence , Computation and Language , Computer Vision and Pattern Recognition , Multimedia , Sound

Publish: 2025-11-23 16:51:05 UTC

2512.05126

#1 SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model [PDF2] [Copy] [Kimi] [REL]

#1 SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model [PDF²] [Copy] [Kimi] [REL]