π€ AI Summary
To address the scarcity of high-quality, human-annotated video data for training large video-language models, this paper introduces the first large-scale synthetic data generation paradigm tailored for video instruction following. We construct LLaVA-Video-178Kβa high-fidelity synthetic video instruction dataset comprising fine-grained captioning, open-ended question answering, and multiple-choice QAβand propose a unified cross-task format with mixed-data training. Our approach leverages multimodal large models for automated data synthesis and employs end-to-end video-language joint instruction tuning to align visual understanding with linguistic instructions. The resulting model, LLaVA-Video, achieves state-of-the-art performance across multiple video understanding benchmarks, demonstrating that synthetically generated data can match the efficacy of real human annotations. To foster reproducibility and community advancement, we fully open-source the dataset, data generation pipeline, and model checkpoints.
π Abstract
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.