🤖 AI Summary
This work addresses the challenge of speaker extraction in noisy audiovisual scenarios, where existing methods tightly couple speech separation with target selection, often compromising audio fidelity. To overcome this limitation, the authors propose a multimodal decoupled architecture that freezes a pretrained audio-only separation backbone to preserve its acoustic priors, while leveraging visual cues exclusively for target speaker selection. The key innovation lies in the Latent Steering Matrix (LSM), a lightweight linear transformation operating in the latent space that redirects features toward the target speaker without relearning the separation process. Evaluated across four mainstream architectures, the approach demonstrates strong generality, significantly improving target speaker extraction fidelity while maintaining the perceptual quality of the original audio separation backbone.
📝 Abstract
The goal of this paper is to provide a new perspective on audio-visual target speaker extraction (AV-TSE) by decoupling the separation and target selection. Conventional AV-TSE systems typically integrate audio and visual features deeply to re-learn the entire separation process, which can act as a fidelity ceiling due to the noisy nature of in-the-wild audio-visual datasets. To address this, we propose Plug-and-Steer, which assigns high-fidelity separation to a frozen audio-only backbone and limits the role of visual modality strictly to target selection. We introduce the Latent Steering Matrix (LSM), a minimalist linear transformation that re-routes latent features within the backbone to anchor the target speaker to a designated channel. Experiments across four representative architectures show that our method effectively preserves the acoustic priors of diverse backbones, achieving perceptual quality comparable to the original backbones. Audio samples are available at: https://plugandsteer.github.io