TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the challenge of multi-video event understanding, where models often struggle to accurately locate scattered evidence segments across long, heterogeneous videos and frequently overlook dense textual cues such as subtitles and scoreboards. To tackle this, the authors propose a “locate-then-reason” framework: visual content is first converted into a searchable, text-based timeline via OCR and object detection; then, a pure-text large language model performs query-aware cross-video evidence localization, which subsequently guides a vision-language model to generate factually grounded statements with explicit citations. This structured evidence localization mechanism substantially improves both factual completeness and attribution accuracy, achieving state-of-the-art performance on the MAGMaR 2026 validation set—raising the macro MiRAGE F1 score from 0.705 to 0.811 and boosting citation recall from 0.440 to 0.628.

📝 Abstract

Multi-video event understanding demands models that can locate and attribute query-relevant evidence scattered across long, heterogeneous video corpora. Existing large vision-language models (LVLMs) often underperform in this regime because they quickly exhaust their context budget and struggle to precisely localize evidentially important segments, frequently missing dense informational cues such as broadcast graphics, subtitles, and scoreboards. We introduce TRACE, an evidence grounding-guided framework that follows a ground-before-reasoning strategy for multi-video event reasoning. Our approach first builds a structured, text-searchable timeline for each video using OCR and object detection. A text-only LLM then conducts query-aware evidence localization, selecting relevant moments prior to any downstream visual reasoning. The retrieved frames and their grounding summaries are subsequently used to steer LVLM-based claim generation and cross-video citation consolidation. Experiments on MAGMaR 2026 and WikiVideo demonstrate that structured grounding markedly boosts factual completeness and attribution fidelity. On the MAGMaR validation split, TRACE raises macro-average MiRAGE F1 from 0.705 to 0.811 compared to an unguided Qwen3-VL-30B baseline, with especially strong improvements in citation recall from 0.440 to 0.628. The method also attains state-of-the-art results on the official MAGMaR 2026 leaderboard.

Problem

Research questions and friction points this paper is trying to address.

multi-video event understanding

evidence grounding

vision-language models

information localization

factual attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

evidence grounding

ground-before-reasoning

multi-video event understanding