RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

In multimodal question answering, modality inconsistency—e.g., off-screen actions or voice-over narration—misleads fusion models in estimating cross-modal relevance, severely degrading localization accuracy. To address this, we propose QuART, a query-conditioned cross-modal gating module enabling query-guided, token-level alignment. We further design a three-stage progressive training paradigm: unimodal pretraining → query-aligned fusion → disagreement-aware fine-tuning. Additionally, we introduce AVS-QA, the first large-scale audio-visual-sensor synchronized QA benchmark. Our method incorporates learnable scalar gating scores, stage-wise contrastive learning, and adversarial robust training to align heterogeneous multimodal embeddings. Evaluated on seven benchmarks, our approach achieves up to 14.5% average accuracy gain; integrating sensor modalities yields an additional 16.4% improvement; under modality corruption, it surpasses state-of-the-art robustness by 50.23%.

Technology Category

Application Category

📝 Abstract

Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at https://github.com/BASHLab/RAVEN.

Problem

Research questions and friction points this paper is trying to address.

Aligns multimodal tokens with queries for accurate QA

Addresses modality disagreements in audio, video, and sensor data

Improves robustness to noise and missing modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-conditioned cross-modal gating module

Three-stage training pipeline

AVS-QA dataset for training

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding