UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-speech (TTS) models generate natural speech but struggle with text-driven, controllable co-synthesis of speech and complex environmental sounds due to the absence of real-world aligned speech–environment audio pairs. To address this, we propose the first end-to-end environment-aware TTS framework based on flow matching for joint generation. We introduce a novel self-supervised acoustic disentanglement method that decomposes unlabeled recordings into speech, text, and background components—overcoming the data scarcity bottleneck. Furthermore, we design a joint conditional modeling scheme with fine-grained environmental intensity control. Experiments demonstrate significant improvements over state-of-the-art models across objective and subjective metrics—including naturalness, environmental consistency, and scene diversity—achieving, for the first time, high-fidelity, controllable, text-driven joint synthesis of speech and environmental audio.

Technology Category

Application Category

📝 Abstract
Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.
Problem

Research questions and friction points this paper is trying to address.

Integrating speech with complex background environments in TTS
Lack of paired training data for speech and background audio
Generating context-aware, high-quality audio scenes with fine control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow-matching TTS for speech and environment
Self-supervised framework for unannotated data
Fine-grained control over background audio
🔎 Similar Papers
No similar papers found.