Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the challenge of speaker extraction in noisy audiovisual scenarios, where existing methods tightly couple speech separation with target selection, often compromising audio fidelity. To overcome this limitation, the authors propose a multimodal decoupled architecture that freezes a pretrained audio-only separation backbone to preserve its acoustic priors, while leveraging visual cues exclusively for target speaker selection. The key innovation lies in the Latent Steering Matrix (LSM), a lightweight linear transformation operating in the latent space that redirects features toward the target speaker without relearning the separation process. Evaluated across four mainstream architectures, the approach demonstrates strong generality, significantly improving target speaker extraction fidelity while maintaining the perceptual quality of the original audio separation backbone.

Technology Category

Application Category

📝 Abstract

The goal of this paper is to provide a new perspective on audio-visual target speaker extraction (AV-TSE) by decoupling the separation and target selection. Conventional AV-TSE systems typically integrate audio and visual features deeply to re-learn the entire separation process, which can act as a fidelity ceiling due to the noisy nature of in-the-wild audio-visual datasets. To address this, we propose Plug-and-Steer, which assigns high-fidelity separation to a frozen audio-only backbone and limits the role of visual modality strictly to target selection. We introduce the Latent Steering Matrix (LSM), a minimalist linear transformation that re-routes latent features within the backbone to anchor the target speaker to a designated channel. Experiments across four representative architectures show that our method effectively preserves the acoustic priors of diverse backbones, achieving perceptual quality comparable to the original backbones. Audio samples are available at: https://plugandsteer.github.io

Problem

Research questions and friction points this paper is trying to address.

audio-visual target speaker extraction

separation and selection decoupling

fidelity ceiling

noisy audio-visual datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-visual target speaker extraction

decoupling separation and selection

latent steering matrix

frozen audio backbone