Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world speech scenarios often involve concurrent noise corruption and speaker overlap, yet conventional speech enhancement (SE) and speaker separation (SS) are typically modeled independently; existing joint approaches rely heavily on complex supervised training, limiting generalizability and scalability. This paper proposes UniVoiceLite: a lightweight, unsupervised audio-visual unified framework that— for the first time—incorporates Wasserstein distance regularization into audio-visual speaker separation to stabilize the latent space without paired data, enabling simultaneous noise suppression and speaker separation within a single model. Our method integrates lip-motion dynamics, facial identity embeddings, and a streamlined network architecture, supporting end-to-end unsupervised training. Experiments demonstrate state-of-the-art performance in challenging multi-noise–multi-speaker settings, with minimal parameter count, low inference latency, and strong cross-dataset generalization. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at https://github.com/jisoo-o/UniVoiceLite.
Problem

Research questions and friction points this paper is trying to address.

Unifies speech enhancement and separation tasks
Reduces model complexity and parameter reliance
Enables unsupervised learning without paired data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight unsupervised audio-visual framework unifies speech tasks
Uses lip motion and facial identity cues to guide extraction
Employs Wasserstein distance regularization for stable latent space
🔎 Similar Papers
No similar papers found.
J
Jisoo Park
Department of Artificial Intelligence, Chung-Ang University, South Korea
S
Seonghak Lee
Department of Computer Science and Engineering, Chung-Ang University, South Korea
G
Guisik Kim
Korea Electronics Technology Institute, South Korea
T
Taewoo Kim
Korea Electronics Technology Institute, South Korea
Junseok Kwon
Junseok Kwon
Chung-Ang University
Computer VisionMachine Learning