Neptune: The Long Orbit to Benchmarking Long Video Understanding

📅 2024-12-12

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Current long-video understanding models exhibit limited performance on complex tasks—such as temporal reasoning, quantitative computation, and state change tracking—especially on 15-minute videos, and suffer from a lack of open, reproducible evaluation benchmarks and tooling. To address this, we introduce the first multimodal long-video understanding benchmark targeting videos up to 15 minutes, centered on three core challenges: temporal ordering, state evolution, and cross-modal association. We propose a large-model-driven automatic annotation pipeline that generates time-aligned dense captions and highly distractive question-answer pairs. Additionally, we design GEM, a lightweight, open-source evaluation model enabling fine-grained assessment of open-ended generative answers. Experiments reveal that leading open-source long-video models consistently underperform on this benchmark, confirming its diagnostic utility and its potential to catalyze progress in long-video understanding research.

Technology Category

Application Category

📝 Abstract

We introduce Neptune, a benchmark for long video understanding that requires reasoning over long time horizons and across different modalities. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune

Problem

Research questions and friction points this paper is trying to address.

Long Video Understanding

Complex Reasoning

Evaluation Tool

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neptune

GEM

Long Video Understanding

🔎 Similar Papers

LVBench: An Extreme Long Video Understanding Benchmark