🤖 AI Summary
Existing audio-visual target speech extraction (AV-TSE) methods over-rely on visual cues, rendering them vulnerable to visual degradation, unseen languages, speaker switches, and multiple interfering speakers. To address these challenges, this paper introduces a cross-modal language-guided framework that, for the first time, integrates multi-level linguistic knowledge from large language models (LLMs)—including output constraints, intermediate predictions, and input priors—into AV-TSE. We synergistically leverage RoBERTa and Qwen3 to model linguistic structure and dynamically fuse it with visual-audio representations within two mainstream AV-TSE backbones, enabling linguistically informed decoding. Experiments demonstrate substantial improvements in robustness and generalization under visual corruption, cross-lingual, cross-domain, and multi-speaker conditions. Our method achieves state-of-the-art performance across multiple standard benchmarks.
📝 Abstract
Audio-visual target speaker extraction (AV-TSE) models primarily rely on visual cues from the target speaker. However, humans also leverage linguistic knowledge, such as syntactic constraints, next word prediction, and prior knowledge of conversation, to extract target speech. Inspired by this observation, we propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models through three distinct guidance strategies: output linguistic constraints, intermediate linguistic prediction, and input linguistic prior. Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones demon- strate the effectiveness of our approach. Significant improvements are observed in challenging scenarios, including visual cue impaired, unseen languages, target speaker switches, increased interfering speakers, and out-of-domain test set. Demo page: https://alexwxwu.github.io/ELEGANCE/.