🤖 AI Summary
This study addresses the challenge of enumerating and counting sparse, diverse events in ultra-long videos—a task where existing methods exhibit significant limitations in long-range temporal reasoning and interpretability. The work introduces the first unified evaluation benchmark specifically designed for ultra-long videos, which simultaneously assesses models’ capabilities in event enumeration, quantitative counting, and temporal evidence localization, supported by explicit temporal grounding annotations. A systematic evaluation of 22 multimodal large language models on 152 videos (each exceeding 30 minutes) and 1,699 annotated queries reveals that even the best-performing model achieves only 29.98% accuracy in enumeration and 23.74% in counting—substantially below human performance (78.57% and 82.97%, respectively)—highlighting a fundamental bottleneck in current models’ ability to perform quantitative visual reasoning over extended temporal spans.
📝 Abstract
Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches 78.57% and 82.97%, respectively. Our analysis reveals strong relationships between enumeration accuracy, temporal grounding, and counting performance. These results highlight fundamental limitations of current MLLMs and establish EC-Bench as a challenging benchmark for long-form quantitative video reasoning.