Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Speech separation from single-microphone recordings under environmental noise remains challenging, particularly in unsupervised settings lacking clean/noisy speech pairs. Method: This paper proposes an unsupervised audio-visual diffusion model that leverages lip-motion cues to construct a generative prior, jointly modeling the distribution of clean speech and structured noise. Crucially, it directly parameterizes the noise distribution—rather than treating it as residual—and co-optimizes it with the speech distribution via audio-visual score matching and inverse diffusion sampling for posterior decomposition-based denoising. Training requires only unpaired audio-visual data, with no supervision from aligned clean/noisy speech. Contribution/Results: The framework achieves significant improvements over existing unsupervised methods in complex, realistic noise conditions. Experiments validate the effectiveness and robustness of explicit noise modeling combined with audio-visual diffusion, demonstrating superior generalization without paired training data.

Technology Category

Application Category

📝 Abstract

In this paper, we address the problem of single-microphone speech separation in the presence of ambient noise. We propose a generative unsupervised technique that directly models both clean speech and structured noise components, training exclusively on these individual signals rather than noisy mixtures. Our approach leverages an audio-visual score model that incorporates visual cues to serve as a strong generative speech prior. By explicitly modelling the noise distribution alongside the speech distribution, we enable effective decomposition through the inverse problem paradigm. We perform speech separation by sampling from the posterior distributions via a reverse diffusion process, which directly estimates and removes the modelled noise component to recover clean constituent signals. Experimental results demonstrate promising performance, highlighting the effectiveness of our direct noise modelling approach in challenging acoustic environments.

Problem

Research questions and friction points this paper is trying to address.

Separating speech from ambient noise using single microphone

Modeling clean speech and structured noise components unsupervisedly

Leveraging audio-visual cues for generative speech prior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based unsupervised audio-visual separation

Explicit noise distribution modelling with priors

Reverse diffusion process for noise removal

🔎 Similar Papers

Diffusion-based Unsupervised Audio-visual Speech Enhancement