π€ AI Summary
Existing benchmarks lack rigorous evaluation of high-level temporal reasoning in instructional video understanding.
Method: We introduce TeachVidBench, the first benchmark dedicated to step-wise instructional videos. It employs a vision-perception-guided filtering mechanism to exclude commonsense-answerable questions; adopts temporal-aware video sampling and annotation protocols; and leverages GPT-4 for multimodal question-answer generation, supported by an automated synthetic framework yielding 5K high-quality QA pairs and 19K synthetic QA instances.
Contribution/Results: We establish a specialized evaluation framework for instructional videos and systematically assess 12 open- and closed-source Video-LLMs. Our evaluation reveals that even the state-of-the-art GPT-4o achieves only 53.42% accuracy, highlighting a critical temporal reasoning bottleneck. TeachVidBench is publicly released to serve as a standardized, community-driven benchmark for advancing temporal reasoning in instructional video understanding.
π Abstract
Despite progress in video large language models (Video-LLMs), research on instructional video understanding, crucial for enhancing access to instructional content, remains insufficient. To address this, we introduce InstructionBench, an Instructional video understanding Benchmark, which challenges models' advanced temporal reasoning within instructional videos characterized by their strict step-by-step flow. Employing GPT-4, we formulate Q&A pairs in open-ended and multiple-choice formats to assess both Coarse-Grained event-level and Fine-Grained object-level reasoning. Our filtering strategies exclude questions answerable purely by common-sense knowledge, focusing on visual perception and analysis when evaluating Video-LLM models. The benchmark finally contains 5k questions across over 700 videos. We evaluate the latest Video-LLMs on our InstructionBench, finding that closed-source models outperform open-source ones. However, even the best model, GPT-4o, achieves only 53.42% accuracy, indicating significant gaps in temporal reasoning. To advance the field, we also develop a comprehensive instructional video dataset with over 19k Q&A pairs from nearly 2.5k videos, using an automated data generation framework, thereby enriching the community's research resources.