🤖 AI Summary
Long-video understanding faces challenges in temporal modeling, strong contextual dependency, and coarse-grained event localization. Method: We introduce LongViTU—the first large-scale instruction-tuning dataset for long videos—comprising 121K QA pairs from 900 hours of video. It features a hierarchical tree-structured video representation and a self-correcting QA generation mechanism, enabling explicit timestamp annotation, an average context length of 4.6 minutes, and multi-dimensional reasoning (e.g., causal, planning, commonsense). We also establish the first instruction-following evaluation benchmark tailored for streaming long videos. Results: The LongVU model, supervised fine-tuned on LongViTU, achieves +12.0% improvement on our in-distribution benchmark and +1.6% average gain across out-of-distribution benchmarks (EgoSchema, VideoMME, WorldQA, OpenEQA). GPT-4 human evaluation scores (49.9–52.3) confirm the task’s high difficulty and LongViTU’s strong generalization capability.
📝 Abstract
This paper introduce LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We developed a systematic approach that organizes videos into a hierarchical tree structure and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.); and 3) explicit timestamp labels for relevant events. LongViTU also serves as a benchmark for instruction following in long-form and streaming video understanding. We evaluate the open-source state-of-the-art long video understanding model, LongVU, and the commercial model, Gemini-1.5-Pro, on our benchmark. They achieve GPT-4 scores of 49.9 and 52.3, respectively, underscoring the substantial challenge posed by our benchmark. Further supervised fine-tuning (SFT) on LongVU led to performance improvements of 12.0% on our benchmark, 2.2% on the in-distribution (ID) benchmark EgoSchema, 1.0%, 2.2% and 1.2% on the out-of-distribution (OOD) benchmarks VideoMME (Long), WorldQA and OpenEQA, respectively. These outcomes demonstrate LongViTU's high data quality and robust OOD generalizability.