Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech separation from single-microphone recordings under environmental noise remains challenging, particularly in unsupervised settings lacking clean/noisy speech pairs. Method: This paper proposes an unsupervised audio-visual diffusion model that leverages lip-motion cues to construct a generative prior, jointly modeling the distribution of clean speech and structured noise. Crucially, it directly parameterizes the noise distribution—rather than treating it as residual—and co-optimizes it with the speech distribution via audio-visual score matching and inverse diffusion sampling for posterior decomposition-based denoising. Training requires only unpaired audio-visual data, with no supervision from aligned clean/noisy speech. Contribution/Results: The framework achieves significant improvements over existing unsupervised methods in complex, realistic noise conditions. Experiments validate the effectiveness and robustness of explicit noise modeling combined with audio-visual diffusion, demonstrating superior generalization without paired training data.

Technology Category

Application Category

📝 Abstract
In this paper, we address the problem of single-microphone speech separation in the presence of ambient noise. We propose a generative unsupervised technique that directly models both clean speech and structured noise components, training exclusively on these individual signals rather than noisy mixtures. Our approach leverages an audio-visual score model that incorporates visual cues to serve as a strong generative speech prior. By explicitly modelling the noise distribution alongside the speech distribution, we enable effective decomposition through the inverse problem paradigm. We perform speech separation by sampling from the posterior distributions via a reverse diffusion process, which directly estimates and removes the modelled noise component to recover clean constituent signals. Experimental results demonstrate promising performance, highlighting the effectiveness of our direct noise modelling approach in challenging acoustic environments.
Problem

Research questions and friction points this paper is trying to address.

Separating speech from ambient noise using single microphone
Modeling clean speech and structured noise components unsupervisedly
Leveraging audio-visual cues for generative speech prior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based unsupervised audio-visual separation
Explicit noise distribution modelling with priors
Reverse diffusion process for noise removal
🔎 Similar Papers
No similar papers found.