LongViTU: Instruction Tuning for Long-Form Video Understanding

📅 2025-01-09

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Long-video understanding faces challenges in temporal modeling, strong contextual dependency, and coarse-grained event localization. Method: We introduce LongViTU—the first large-scale instruction-tuning dataset for long videos—comprising 121K QA pairs from 900 hours of video. It features a hierarchical tree-structured video representation and a self-correcting QA generation mechanism, enabling explicit timestamp annotation, an average context length of 4.6 minutes, and multi-dimensional reasoning (e.g., causal, planning, commonsense). We also establish the first instruction-following evaluation benchmark tailored for streaming long videos. Results: The LongVU model, supervised fine-tuned on LongViTU, achieves +12.0% improvement on our in-distribution benchmark and +1.6% average gain across out-of-distribution benchmarks (EgoSchema, VideoMME, WorldQA, OpenEQA). GPT-4 human evaluation scores (49.9–52.3) confirm the task’s high difficulty and LongViTU’s strong generalization capability.

Technology Category

Application Category

📝 Abstract

This paper introduce LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We developed a systematic approach that organizes videos into a hierarchical tree structure and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.); and 3) explicit timestamp labels for relevant events. LongViTU also serves as a benchmark for instruction following in long-form and streaming video understanding. We evaluate the open-source state-of-the-art long video understanding model, LongVU, and the commercial model, Gemini-1.5-Pro, on our benchmark. They achieve GPT-4 scores of 49.9 and 52.3, respectively, underscoring the substantial challenge posed by our benchmark. Further supervised fine-tuning (SFT) on LongVU led to performance improvements of 12.0% on our benchmark, 2.2% on the in-distribution (ID) benchmark EgoSchema, 1.0%, 2.2% and 1.2% on the out-of-distribution (OOD) benchmarks VideoMME (Long), WorldQA and OpenEQA, respectively. These outcomes demonstrate LongViTU's high data quality and robust OOD generalizability.

Problem

Research questions and friction points this paper is trying to address.

Big Dataset Development

Long Video Analysis

Temporal Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

LongViTU

Video Understanding

Large-scale Dataset

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding