Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

📅 2025-12-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Real-world speech scenarios often involve concurrent noise corruption and speaker overlap, yet conventional speech enhancement (SE) and speaker separation (SS) are typically modeled independently; existing joint approaches rely heavily on complex supervised training, limiting generalizability and scalability. This paper proposes UniVoiceLite: a lightweight, unsupervised audio-visual unified framework that— for the first time—incorporates Wasserstein distance regularization into audio-visual speaker separation to stabilize the latent space without paired data, enabling simultaneous noise suppression and speaker separation within a single model. Our method integrates lip-motion dynamics, facial identity embeddings, and a streamlined network architecture, supporting end-to-end unsupervised training. Experiments demonstrate state-of-the-art performance in challenging multi-noise–multi-speaker settings, with minimal parameter count, low inference latency, and strong cross-dataset generalization. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at https://github.com/jisoo-o/UniVoiceLite.

Problem

Research questions and friction points this paper is trying to address.

Unifies speech enhancement and separation tasks

Reduces model complexity and parameter reliance

Enables unsupervised learning without paired data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight unsupervised audio-visual framework unifies speech tasks

Uses lip motion and facial identity cues to guide extraction

Employs Wasserstein distance regularization for stable latent space

🔎 Similar Papers

Diffusion-based Unsupervised Audio-visual Speech Enhancement