🤖 AI Summary
Existing video understanding benchmarks lack sufficient video duration, hindering rigorous evaluation of multimodal large language models’ (MLLMs) long-sequence modeling capabilities. Method: We introduce LongVideoBench—the first holistic, task-unified benchmark for long-video understanding—comprising 1,376 videos (averaging ~2 hours each) across 16 categories and 252K video-question-answer (VQA) pairs. We propose a unified QA format spanning nine diverse tasks and develop a GPT-4o–driven fully automated annotation pipeline integrating long-video semantic segmentation alignment with human quality verification. Contribution/Results: LongVideoBench is the largest long-video understanding benchmark to date in terms of number of videos, total duration, and VQA count. Extensive experiments reveal substantial performance degradation of state-of-the-art MLLMs on this benchmark, confirming its high difficulty and establishing it as an authoritative, scalable, and reproducible standard for evaluating long-video comprehension.
📝 Abstract
From image to video understanding, the capabilities of Multi-modal LLMs (MLLMs) are increasingly powerful. However, most existing video understanding benchmarks are relatively short, which makes them inadequate for effectively evaluating the long-sequence modeling capabilities of MLLMs. This highlights the urgent need for a comprehensive and integrated long video understanding benchmark to assess the ability of MLLMs thoroughly. To this end, we propose ALLVB (ALL-in-One Long Video Understanding Benchmark). ALLVB's main contributions include: 1) It integrates 9 major video understanding tasks. These tasks are converted into video QA formats, allowing a single benchmark to evaluate 9 different video understanding capabilities of MLLMs, highlighting the versatility, comprehensiveness, and challenging nature of ALLVB. 2) A fully automated annotation pipeline using GPT-4o is designed, requiring only human quality control, which facilitates the maintenance and expansion of the benchmark. 3) It contains 1,376 videos across 16 categories, averaging nearly 2 hours each, with a total of 252k QAs. To the best of our knowledge, it is the largest long video understanding benchmark in terms of the number of videos, average duration, and number of QAs. We have tested various mainstream MLLMs on ALLVB, and the results indicate that even the most advanced commercial models have significant room for improvement. This reflects the benchmark's challenging nature and demonstrates the substantial potential for development in long video understanding.