Video Instruction Tuning With Synthetic Data

πŸ“… 2024-10-03
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 248
✨ Influential: 64
πŸ“„ PDF
πŸ€– AI Summary
To address the scarcity of high-quality, human-annotated video data for training large video-language models, this paper introduces the first large-scale synthetic data generation paradigm tailored for video instruction following. We construct LLaVA-Video-178Kβ€”a high-fidelity synthetic video instruction dataset comprising fine-grained captioning, open-ended question answering, and multiple-choice QAβ€”and propose a unified cross-task format with mixed-data training. Our approach leverages multimodal large models for automated data synthesis and employs end-to-end video-language joint instruction tuning to align visual understanding with linguistic instructions. The resulting model, LLaVA-Video, achieves state-of-the-art performance across multiple video understanding benchmarks, demonstrating that synthetically generated data can match the efficacy of real human annotations. To foster reproducibility and community advancement, we fully open-source the dataset, data generation pipeline, and model checkpoints.

Technology Category

Application Category

πŸ“ Abstract
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
Problem

Research questions and friction points this paper is trying to address.

Addresses lack of high-quality video data for multimodal models
Creates synthetic dataset for video instruction-following tasks
Develops video LMM with strong benchmark performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset for video instruction-following
Combines existing visual instruction tuning data
Achieves strong performance across video benchmarks
πŸ”Ž Similar Papers
No similar papers found.