ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual target speech extraction (AV-TSE) methods over-rely on visual cues, rendering them vulnerable to visual degradation, unseen languages, speaker switches, and multiple interfering speakers. To address these challenges, this paper introduces a cross-modal language-guided framework that, for the first time, integrates multi-level linguistic knowledge from large language models (LLMs)—including output constraints, intermediate predictions, and input priors—into AV-TSE. We synergistically leverage RoBERTa and Qwen3 to model linguistic structure and dynamically fuse it with visual-audio representations within two mainstream AV-TSE backbones, enabling linguistically informed decoding. Experiments demonstrate substantial improvements in robustness and generalization under visual corruption, cross-lingual, cross-domain, and multi-speaker conditions. Our method achieves state-of-the-art performance across multiple standard benchmarks.

Technology Category

Application Category

📝 Abstract
Audio-visual target speaker extraction (AV-TSE) models primarily rely on visual cues from the target speaker. However, humans also leverage linguistic knowledge, such as syntactic constraints, next word prediction, and prior knowledge of conversation, to extract target speech. Inspired by this observation, we propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models through three distinct guidance strategies: output linguistic constraints, intermediate linguistic prediction, and input linguistic prior. Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones demon- strate the effectiveness of our approach. Significant improvements are observed in challenging scenarios, including visual cue impaired, unseen languages, target speaker switches, increased interfering speakers, and out-of-domain test set. Demo page: https://alexwxwu.github.io/ELEGANCE/.
Problem

Research questions and friction points this paper is trying to address.

Incorporating linguistic knowledge from LLMs into audio-visual speech extraction
Addressing limitations of relying solely on visual cues for target speech extraction
Improving performance in challenging scenarios like impaired visual cues and unseen languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates linguistic knowledge from LLMs into AV-TSE models
Uses output constraints and intermediate prediction strategies
Leverages input linguistic prior for enhanced speech extraction
🔎 Similar Papers