SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the challenge of separating speech from background noise under real-world conditions using a single microphone by proposing an unsupervised audio-visual separation and enhancement method based on generative inverse sampling. It introduces diffusion-based inverse sampling into unsupervised speech–noise disentanglement for the first time, modeling distinct diffusion priors for clean speech and environmental noise while leveraging visual cues to jointly recover all latent sound sources. The proposed framework accommodates multi-talker and off-screen speaker scenarios, achieving significantly lower word error rates (WER) than existing supervised baselines under noisy mixtures of one to three speakers. Moreover, it faithfully reconstructs background noise, thereby supporting downstream acoustic scene analysis tasks with high fidelity.

Technology Category

Application Category

📝 Abstract

This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, we reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in \ac{WER} across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream acoustic scene detection. Demo page: https://ssnapsicml.github.io/ssnapsicml2026/

Problem

Research questions and friction points this paper is trying to address.

audio-visual speech separation

background noise

single-microphone

speech enhancement

real-world environmental noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion inverse sampling

audio-visual speech separation

unsupervised source separation