DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the challenge of disentangling speaker timbre from background environmental attributes in environment-aware text-to-speech (TTS), this paper proposes the first zero-shot, disentangled audio completion framework. Methodologically, building upon the F5-TTS architecture, we introduce a pre-trained speech-environment separation module, a random span masking strategy, and a dual classifier-free guidance mechanism coupled with signal-to-noise ratio (SNR)-adaptive control—enabling independent, fine-grained control over linguistic content, speaker identity, and acoustic environment. Experiments demonstrate significant improvements over baselines across naturalness, speaker similarity, and environmental fidelity metrics. Our approach achieves high-quality, joint speech–environment synthesis for the first time, establishing a novel paradigm for personalized, context-adaptive TTS systems.

Technology Category

Application Category

📝 Abstract

This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual class-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.

Problem

Research questions and friction points this paper is trying to address.

Enabling independent control of speaker timbre and background environment

Disentangling environmental speech into clean speech and environment audio

Generating environmental personalized speech with high fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled audio infilling for environment control

Dual class-free guidance for enhanced controllability

Signal-to-noise ratio adaptation strategy alignment

🔎 Similar Papers

No similar papers found.