NeMo: Needle in a Montage for Video-Language Understanding

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current Video Large Language Models (VideoLLMs) lack benchmarks capable of rigorously evaluating their long-horizon complex reasoning capabilities—particularly long-context recall and precise temporal localization. To address this gap, we propose “Needles in Montage” (NeMo), the first adaptation of the “needle-in-a-haystack” paradigm to video-language understanding. NeMo introduces a scalable, automated data synthesis framework that generates high-quality, temporally grounded question-answer (QA) pairs for multi-duration videos. Leveraging this framework, we construct and publicly release NeMoBench, a large-scale video QA benchmark comprising 31,378 QA instances. Comprehensive evaluation across 20 state-of-the-art VideoLLMs reveals substantial limitations in temporal reasoning—especially under extended contexts—highlighting critical bottlenecks in current architectures. NeMoBench establishes a reproducible, extensible, and continuously updatable standard evaluation platform to advance research in video-language reasoning.

Technology Category

Application Category

📝 Abstract

Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating complex temporal reasoning in video-language models

Assessing long-context recall and temporal grounding capabilities

Automating scalable video QA data generation for benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline generates video QA data

Needle in a Montage tests temporal reasoning

NeMoBench benchmark evaluates 20 VideoLLMs

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs