$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing audio-visual target speech extraction (AV-TSE) methods over-rely on local acoustic modeling, leading to inconsistent speech reconstruction, insufficient interference suppression, and segment-level quality fluctuations. To address these limitations, we propose the model-agnostic Mask-And-Recover (MAR) framework, which enhances long-range dependencies via cross-modal and intra-modal global contextual modeling. We further introduce the Fine-grained Confidence Scoring (FCS) mechanism—the first of its kind—to dynamically identify and refine low-quality speech segments. MAR integrates multimodal alignment, confidence-guided training, and a lightweight architecture. Evaluated on VoxCeleb2, it demonstrates broad compatibility: it consistently improves SI-SNR, PESQ, and STOI across six mainstream backbone networks. Notably, it achieves superior inter-segment coherence and enhanced suppression of interfering speakers, delivering robust and consistent AV-TSE performance.

Technology Category

Application Category

📝 Abstract

Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.

Problem

Research questions and friction points this paper is trying to address.

Enhances audio-visual target speaker extraction using contextual information

Improves inconsistent extraction quality across speech segments

Introduces model-agnostic training for better performance metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Mask-And-Recover for global context integration

Implements Fine-grained Confidence Score for quality assessment

Model-agnostic training improves multiple AV-TSE backbones

🔎 Similar Papers

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention