🤖 AI Summary
In multimodal question answering, modality inconsistency—e.g., off-screen actions or voice-over narration—misleads fusion models in estimating cross-modal relevance, severely degrading localization accuracy. To address this, we propose QuART, a query-conditioned cross-modal gating module enabling query-guided, token-level alignment. We further design a three-stage progressive training paradigm: unimodal pretraining → query-aligned fusion → disagreement-aware fine-tuning. Additionally, we introduce AVS-QA, the first large-scale audio-visual-sensor synchronized QA benchmark. Our method incorporates learnable scalar gating scores, stage-wise contrastive learning, and adversarial robust training to align heterogeneous multimodal embeddings. Evaluated on seven benchmarks, our approach achieves up to 14.5% average accuracy gain; integrating sensor modalities yields an additional 16.4% improvement; under modality corruption, it surpasses state-of-the-art robustness by 50.23%.
📝 Abstract
Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at https://github.com/BASHLab/RAVEN.