🤖 AI Summary
To address the challenge of disentangling speaker timbre from background environmental attributes in environment-aware text-to-speech (TTS), this paper proposes the first zero-shot, disentangled audio completion framework. Methodologically, building upon the F5-TTS architecture, we introduce a pre-trained speech-environment separation module, a random span masking strategy, and a dual classifier-free guidance mechanism coupled with signal-to-noise ratio (SNR)-adaptive control—enabling independent, fine-grained control over linguistic content, speaker identity, and acoustic environment. Experiments demonstrate significant improvements over baselines across naturalness, speaker similarity, and environmental fidelity metrics. Our approach achieves high-quality, joint speech–environment synthesis for the first time, establishing a novel paradigm for personalized, context-adaptive TTS systems.
📝 Abstract
This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual class-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.