TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the challenge of insufficient explicit semantic alignment between audio and visual modalities by proposing a parameter-efficient fine-tuning framework that leverages text as a semantic anchor. Building upon frozen pretrained audio-visual encoders, the approach introduces a Text-Bridged Audio-Visual Adapter (TB-AVA) and a Gated Semantic Modulation (GSM) mechanism guided by textual semantic relevance. This is the first method to employ text as a cross-modal semantic bridge to enable efficient audio-visual feature interaction. Evaluated on multiple benchmarks—including AVE, AVS, and AVVP—the proposed framework achieves state-of-the-art performance, demonstrating the effectiveness and generalizability of text-guided learning in parameter-efficient multimodal representation.
📝 Abstract
Audio-visual understanding requires effective alignment between heterogeneous modalities, yet cross-modal correspondence remains challenging when temporally aligned audio and visual signals lack clear semantic correspondence.We propose to use text as a semantic anchor for audio-visual representation learning.To this end, we introduce a parameter-efficient adaptation frameworkbuilt on frozen audio and visual encoders, centered on Text-Bridged Audio-Visual Adapter (TB-AVA), which enables text-mediated interaction between audio and visual streams. At the core of TB-AVA, Gated Semantic Modulation (GSM) selectively modulates feature channels based on text-inferred semantic relevance. We evaluate the proposed approach on multiple benchmarks, including AVE, AVS, and AVVP, where the proposed framework achieves state-of-the-art performance, demonstrating text as an effective semantic anchor for parameter-efficient fine-tuning (PEFT) in audio-visual learning.
Problem

Research questions and friction points this paper is trying to address.

audio-visual alignment
cross-modal correspondence
semantic correspondence
parameter-efficient fine-tuning
multimodal learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

text-bridged adaptation
parameter-efficient fine-tuning
audio-visual alignment
gated semantic modulation
multimodal representation learning
🔎 Similar Papers
No similar papers found.