🤖 AI Summary
Current text-to-speech (TTS) models generate natural speech but struggle with text-driven, controllable co-synthesis of speech and complex environmental sounds due to the absence of real-world aligned speech–environment audio pairs. To address this, we propose the first end-to-end environment-aware TTS framework based on flow matching for joint generation. We introduce a novel self-supervised acoustic disentanglement method that decomposes unlabeled recordings into speech, text, and background components—overcoming the data scarcity bottleneck. Furthermore, we design a joint conditional modeling scheme with fine-grained environmental intensity control. Experiments demonstrate significant improvements over state-of-the-art models across objective and subjective metrics—including naturalness, environmental consistency, and scene diversity—achieving, for the first time, high-fidelity, controllable, text-driven joint synthesis of speech and environmental audio.
📝 Abstract
Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.