LongViTU: Instruction Tuning for Long-Form Video Understanding

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-video understanding faces challenges in temporal modeling, strong contextual dependency, and coarse-grained event localization. Method: We introduce LongViTU—the first large-scale instruction-tuning dataset for long videos—comprising 121K QA pairs from 900 hours of video. It features a hierarchical tree-structured video representation and a self-correcting QA generation mechanism, enabling explicit timestamp annotation, an average context length of 4.6 minutes, and multi-dimensional reasoning (e.g., causal, planning, commonsense). We also establish the first instruction-following evaluation benchmark tailored for streaming long videos. Results: The LongVU model, supervised fine-tuned on LongViTU, achieves +12.0% improvement on our in-distribution benchmark and +1.6% average gain across out-of-distribution benchmarks (EgoSchema, VideoMME, WorldQA, OpenEQA). GPT-4 human evaluation scores (49.9–52.3) confirm the task’s high difficulty and LongViTU’s strong generalization capability.

Technology Category

Application Category

📝 Abstract
This paper introduce LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We developed a systematic approach that organizes videos into a hierarchical tree structure and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.); and 3) explicit timestamp labels for relevant events. LongViTU also serves as a benchmark for instruction following in long-form and streaming video understanding. We evaluate the open-source state-of-the-art long video understanding model, LongVU, and the commercial model, Gemini-1.5-Pro, on our benchmark. They achieve GPT-4 scores of 49.9 and 52.3, respectively, underscoring the substantial challenge posed by our benchmark. Further supervised fine-tuning (SFT) on LongVU led to performance improvements of 12.0% on our benchmark, 2.2% on the in-distribution (ID) benchmark EgoSchema, 1.0%, 2.2% and 1.2% on the out-of-distribution (OOD) benchmarks VideoMME (Long), WorldQA and OpenEQA, respectively. These outcomes demonstrate LongViTU's high data quality and robust OOD generalizability.
Problem

Research questions and friction points this paper is trying to address.

Big Dataset Development
Long Video Analysis
Temporal Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

LongViTU
Video Understanding
Large-scale Dataset
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30