Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation

📅 2022-06-01
🏛️ Computer Vision and Pattern Recognition
📈 Citations: 23
Influential: 3
📄 PDF
🤖 AI Summary
This work addresses speech separation and enhancement under challenging conditions involving multi-speaker overlap and background noise. We propose an end-to-end waveform-domain multimodal approach that fuses asynchronous/synchronous visual, auditory, and textual modalities. Our key contributions are: (1) the first integration of textual semantics—either as an independent or joint conditioning signal—into speech separation; (2) a cross-modal fusion framework robust to audio-visual temporal misalignment; and (3) a unified Transformer-based architecture modeling multimodal temporal dynamics, coupled with a waveform-domain time-domain convolutional encoder-decoder and fine-grained feature alignment. Evaluated on the LRS2 and LRS3 benchmarks, our method achieves state-of-the-art performance, significantly improving separation quality and speech intelligibility—particularly in noisy environments and under lip-audio desynchronization.

Technology Category

Application Category

📝 Abstract
The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3.
Problem

Research questions and friction points this paper is trying to address.

Speech Separation
Single Speaker Recognition
Background Noise Reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Visual Separation
Content Understanding
Speech Recognition Accuracy