๐ค AI Summary
Existing audio-visual question answering (AVQA) methods implicitly leverage question information and rely on uniform frame sampling, often missing critical temporal cues; while Top-K selection improves localization, its discrete modeling neglects fine-grained temporal dynamics. To address these limitations, we propose a question-aware Gaussian Mixture of Experts (MoE) mechanism and a progressive temporal refinement frameworkโfirst explicitly injecting question semantics into continuous-time modeling. Specifically, we align video frame distributions via Gaussian kernels conditioned on question embeddings, enabling adaptive, continuous or non-contiguous key-frame weighting through question-conditioned MoE routing. We further introduce multi-stage cross-modal alignment and explicit question embedding integration. Evaluated on multiple AVQA benchmarks, our approach achieves state-of-the-art performance, significantly improving both temporal localization accuracy and cross-modal reasoning consistency.
๐ Abstract
Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes extbf{QA-TIGER}, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance. Code is available at https://github.com/AIM-SKKU/QA-TIGER