AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual question answering (AVQA) methods lack dynamic adaptability in temporal sampling and modality preference modeling, hindering their ability to selectively attend to salient audio-visual cues conditioned on the question and limiting multi-step reasoning in complex scenarios. To address this, we propose a Dynamic Adaptive Focusing (DAF) framework: (1) a question-driven dynamic temporal sampling mechanism for precise localization of critical time intervals; (2) a modality preference-aware module that disentangles and quantifies the independent contributions of audio and visual modalities; and (3) a dual-path contrastive loss jointly optimizing cross-modal and cross-temporal consistency and complementarity. Evaluated on four mainstream benchmarks, DAF achieves comprehensive improvements over state-of-the-art methods, with particularly significant gains on questions requiring spatiotemporal reasoning and tight audio-visual coordination—demonstrating the efficacy of dynamic, collaborative representation learning.

Technology Category

Application Category

📝 Abstract
Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model's ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively mitigating redundancy and segment fragmentation in traditional sampling methods. In the modality dimension, we propose a preference-aware strategy that models each modality's contribution independently, enabling selective activation of critical features. Furthermore, we introduce a dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding the model to learn question-specific cross-modal collaborative representations. Experiments on four large-scale benchmarks show that AV-Master significantly outperforms existing methods, especially in complex reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Dynamic temporal sampling for key audio-visual segments
Modality preference modeling to activate critical features
Enhancing cross-modal reasoning in complex audio-visual scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic adaptive focus sampling for temporal segments
Preference-aware strategy for modality contribution modeling
Dual-path contrastive loss for cross-modal consistency
J
Jiayu Zhang
Great Bay University, and Dongguan Key Laboratory for Intelligence and Information Technology
Q
Qilang Ye
Nankai University
Shuo Ye
Shuo Ye
Huazhong University of Science and Technology
Deep learningComputer visionFine-Grained Image Analysis
Xun Lin
Xun Lin
Postdoc, CUHK; PhD, Beihang University
Subtle Visual ComputingMedia Security
Z
Zihan Song
Great Bay University, and Dongguan Key Laboratory for Intelligence and Information Technology
Zitong Yu
Zitong Yu
U.S. Food and Drug Administration
Medical imagingDeep learningMachine learningImage reconstruction