🤖 AI Summary
Current Video Large Language Models (VideoLLMs) lack benchmarks capable of rigorously evaluating their long-horizon complex reasoning capabilities—particularly long-context recall and precise temporal localization. To address this gap, we propose “Needles in Montage” (NeMo), the first adaptation of the “needle-in-a-haystack” paradigm to video-language understanding. NeMo introduces a scalable, automated data synthesis framework that generates high-quality, temporally grounded question-answer (QA) pairs for multi-duration videos. Leveraging this framework, we construct and publicly release NeMoBench, a large-scale video QA benchmark comprising 31,378 QA instances. Comprehensive evaluation across 20 state-of-the-art VideoLLMs reveals substantial limitations in temporal reasoning—especially under extended contexts—highlighting critical bottlenecks in current architectures. NeMoBench establishes a reproducible, extensible, and continuously updatable standard evaluation platform to advance research in video-language reasoning.
📝 Abstract
Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.