π€ AI Summary
This work addresses the challenge of separating speech from background noise under real-world conditions using a single microphone by proposing an unsupervised audio-visual separation and enhancement method based on generative inverse sampling. It introduces diffusion-based inverse sampling into unsupervised speechβnoise disentanglement for the first time, modeling distinct diffusion priors for clean speech and environmental noise while leveraging visual cues to jointly recover all latent sound sources. The proposed framework accommodates multi-talker and off-screen speaker scenarios, achieving significantly lower word error rates (WER) than existing supervised baselines under noisy mixtures of one to three speakers. Moreover, it faithfully reconstructs background noise, thereby supporting downstream acoustic scene analysis tasks with high fidelity.
π Abstract
This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, we reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in \ac{WER} across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream acoustic scene detection. Demo page: https://ssnapsicml.github.io/ssnapsicml2026/