Video Instruction Tuning With Synthetic Data

📅 2024-10-03

🏛️ arXiv.org

📈 Citations: 248

✨ Influential: 64

career value

215K/year

🤖 AI Summary

To address the scarcity of high-quality, human-annotated video data for training large video-language models, this paper introduces the first large-scale synthetic data generation paradigm tailored for video instruction following. We construct LLaVA-Video-178K—a high-fidelity synthetic video instruction dataset comprising fine-grained captioning, open-ended question answering, and multiple-choice QA—and propose a unified cross-task format with mixed-data training. Our approach leverages multimodal large models for automated data synthesis and employs end-to-end video-language joint instruction tuning to align visual understanding with linguistic instructions. The resulting model, LLaVA-Video, achieves state-of-the-art performance across multiple video understanding benchmarks, demonstrating that synthetically generated data can match the efficacy of real human annotations. To foster reproducibility and community advancement, we fully open-source the dataset, data generation pipeline, and model checkpoints.

Technology Category

Application Category

📝 Abstract

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Problem

Research questions and friction points this paper is trying to address.

Addresses lack of high-quality video data for multimodal models

Creates synthetic dataset for video instruction-following tasks

Develops video LMM with strong benchmark performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset for video instruction-following

Combines existing visual instruction tuning data

Achieves strong performance across video benchmarks

🔎 Similar Papers

No similar papers found.