Question-Aware Gaussian Experts for Audio-Visual Question Answering

๐Ÿ“… 2025-03-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing audio-visual question answering (AVQA) methods implicitly leverage question information and rely on uniform frame sampling, often missing critical temporal cues; while Top-K selection improves localization, its discrete modeling neglects fine-grained temporal dynamics. To address these limitations, we propose a question-aware Gaussian Mixture of Experts (MoE) mechanism and a progressive temporal refinement frameworkโ€”first explicitly injecting question semantics into continuous-time modeling. Specifically, we align video frame distributions via Gaussian kernels conditioned on question embeddings, enabling adaptive, continuous or non-contiguous key-frame weighting through question-conditioned MoE routing. We further introduce multi-stage cross-modal alignment and explicit question embedding integration. Evaluated on multiple AVQA benchmarks, our approach achieves state-of-the-art performance, significantly improving both temporal localization accuracy and cross-modal reasoning consistency.

Technology Category

Application Category

๐Ÿ“ Abstract
Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes extbf{QA-TIGER}, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance. Code is available at https://github.com/AIM-SKKU/QA-TIGER
Problem

Research questions and friction points this paper is trying to address.

Explicitly incorporates question information for AVQA.
Models continuous temporal dynamics using Gaussian-based modeling.
Improves focus on question-relevant frames with adaptive sampling.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian-based modeling for temporal dynamics
Mixture of Experts for flexible frame selection
Progressive refinement with explicit question information
๐Ÿ”Ž Similar Papers
No similar papers found.