DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

#1 DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis [PDF³] [Copy] [Kimi] [REL]

Authors: Ye-Xin Lu, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, Zhen-Hua Ling

This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual class-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.

Subjects: Audio and Speech Processing , Sound

Publish: 2025-09-18 07:23:53 UTC

2509.14684

#1 DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis [PDF3] [Copy] [Kimi] [REL]

#1 DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis [PDF³] [Copy] [Kimi] [REL]