TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

198K/year
πŸ€– AI Summary
Existing audio-visual understanding benchmarks struggle to evaluate models’ cross-modal, multi-hop trajectory reasoning in long-form videos and largely overlook robustness against multimodal hallucinations. To address this gap, this work introduces the first unified evaluation benchmark tailored for long audio-visual content, comprising 578 videos and 2,200 multiple-choice questions, with each question spanning an average of 3.68 reasoning hops over 15.1 minutes of video. Constructed via a three-stage semi-automated pipeline, the benchmark encompasses four dimensions and fifteen subtasks, uniquely enabling long-horizon, cross-modal, multi-hop reasoning assessment. Experimental results reveal a decoupling between multimodal hallucination robustness and general reasoning capability; even the strongest closed-source OmniLLM achieves only 68.29% accuracy, while the best open-source model reaches 51.70%, highlighting substantial room for improvement.
πŸ“ Abstract
Real-world audio-visual understanding requires chaining evidence that is sparse, temporally dispersed, and split across the visual and auditory streams, whereas existing benchmarks largely fail to evaluate this capability. They restrict videos to short clips, isolate modalities, or reduce questions to one-hop perception. We introduce TraceAV-Bench, the first benchmark to jointly evaluate multi-hop reasoning over long audio-visual trajectories and multimodal hallucination robustness. TraceAV-Bench comprises 2,200 rigorously validated multiple-choice questions over 578 long videos, totaling 339.5 hours, spanning 4 evaluation dimensions and 15 sub-tasks. Each question is grounded in an explicit reasoning chain that averages 3.68 hops across a 15.1-minute temporal span. The dataset is built by a three-step semi-automated pipeline followed by a strict quality assurance process. Evaluation of multiple representative OmniLLMs on TraceAV-Bench reveals that the benchmark poses a persistent challenge across all models, with the strongest closed-source model (Gemini 3.1 Pro) reaching only 68.29% on general tasks, and the best open-source model (Ming-Flash-Omni-2.0) reaching 51.70%, leaving substantial headroom. Moreover, we find that robustness to multimodal hallucination is largely decoupled from general multimodal reasoning performance. We anticipate that TraceAV-Bench will stimulate further research toward OmniLLMs that can reason coherently and faithfully over long-form audio-visual content.
Problem

Research questions and friction points this paper is trying to address.

multi-hop reasoning
audio-visual understanding
long-form video
multimodal hallucination
trajectory reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-hop reasoning
audio-visual benchmark
long-form video understanding
multimodal hallucination
OmniLLM evaluation
πŸ”Ž Similar Papers
No similar papers found.