🤖 AI Summary
Existing video question answering benchmarks are limited to single-clip queries, hindering audio-visual fine-grained retrieval and complex reasoning across large-scale video collections. To address this, we propose AV-HaystacksQA—a novel task—and its associated AVHaystacks benchmark, the first evaluation framework targeting multi-video spatiotemporal grounding and joint reasoning in realistic scenarios. We introduce MAGNET, a model-agnostic multi-agent collaborative framework integrating cross-video spatiotemporal localization, multimodal prompt orchestration, and step-sequence alignment evaluation. We further propose two new metrics: STEM (SpatioTemporal Exact Matching) and MTGS (Multi-Hop Temporal Generation Score). Experiments show that MAGNET achieves an 89% improvement in BLEU@4 over baselines on AVHaystacks and attains a 65% score under GPT-4-based evaluation, significantly enhancing large language models’ capability for cross-video complex reasoning.
📝 Abstract
Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we propose a model-agnostic, multi-agent framework MAGNET to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystacks. To enable robust evaluation of multi-video retrieval and temporal grounding for optimal response generation, we introduce two new metrics, STEM, which captures alignment errors between a ground truth and a predicted step sequence and MTGS, to facilitate balanced and interpretable evaluation of segment-level grounding performance. Project: https://schowdhury671.github.io/magnet_project/