MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing video question answering benchmarks are limited to single-clip queries, hindering audio-visual fine-grained retrieval and complex reasoning across large-scale video collections. To address this, we propose AV-HaystacksQA—a novel task—and its associated AVHaystacks benchmark, the first evaluation framework targeting multi-video spatiotemporal grounding and joint reasoning in realistic scenarios. We introduce MAGNET, a model-agnostic multi-agent collaborative framework integrating cross-video spatiotemporal localization, multimodal prompt orchestration, and step-sequence alignment evaluation. We further propose two new metrics: STEM (SpatioTemporal Exact Matching) and MTGS (Multi-Hop Temporal Generation Score). Experiments show that MAGNET achieves an 89% improvement in BLEU@4 over baselines on AVHaystacks and attains a 65% score under GPT-4-based evaluation, significantly enhancing large language models’ capability for cross-video complex reasoning.

Technology Category

Application Category

📝 Abstract

Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we propose a model-agnostic, multi-agent framework MAGNET to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystacks. To enable robust evaluation of multi-video retrieval and temporal grounding for optimal response generation, we introduce two new metrics, STEM, which captures alignment errors between a ground truth and a predicted step sequence and MTGS, to facilitate balanced and interpretable evaluation of segment-level grounding performance. Project: https://schowdhury671.github.io/magnet_project/

Problem

Research questions and friction points this paper is trying to address.

Addressing complex reasoning across extensive video collections

Enhancing audio-visual retrieval in large-scale video datasets

Improving multi-video temporal grounding for accurate QA responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework for multi-video reasoning

Audio-visual benchmark with annotated QA pairs

New metrics for alignment and grounding evaluation

🔎 Similar Papers

Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence