UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

#1 UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching [PDF⁴] [Copy] [Kimi⁴] [REL]

Authors: Neta Glazer, Aviv Navon, Yael Segal, Aviv Shamsian, Hilit Segev, Asaf Buchnick, Menachem Pirchi, Gil Hetz, Joseph Keshet

Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.

Subjects: Sound , Machine Learning , Audio and Speech Processing

Publish: 2025-06-11 15:43:08 UTC

2506.09874

#1 UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching [PDF4] [Copy] [Kimi4] [REL]

#1 UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching [PDF⁴] [Copy] [Kimi⁴] [REL]