🤖 AI Summary
Real-world speech scenarios often involve concurrent noise corruption and speaker overlap, yet conventional speech enhancement (SE) and speaker separation (SS) are typically modeled independently; existing joint approaches rely heavily on complex supervised training, limiting generalizability and scalability. This paper proposes UniVoiceLite: a lightweight, unsupervised audio-visual unified framework that— for the first time—incorporates Wasserstein distance regularization into audio-visual speaker separation to stabilize the latent space without paired data, enabling simultaneous noise suppression and speaker separation within a single model. Our method integrates lip-motion dynamics, facial identity embeddings, and a streamlined network architecture, supporting end-to-end unsupervised training. Experiments demonstrate state-of-the-art performance in challenging multi-noise–multi-speaker settings, with minimal parameter count, low inference latency, and strong cross-dataset generalization. The code is publicly available.
📝 Abstract
Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at https://github.com/jisoo-o/UniVoiceLite.