All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
Real-world multimodal video data suffers from high acquisition costs and limited diversity, hindering the training of large-scale multitask video understanding models. To address this challenge, this work proposes the first unified synthetic data generation framework capable of automatically producing unlimited, multitask-compatible multimodal video data. The approach introduces a visual question answering (VQA)-based fine-tuning strategy that replaces conventional caption- or instruction-based supervision with structured question-answer pairs to enhance the model’s visual reasoning and localization capabilities. Remarkably, models trained exclusively on this synthetic data achieve performance on par with or even surpassing fully supervised baselines on three distinct tasks—video object counting, video question answering, and video segmentation—demonstrating strong generalization and effectiveness across real-world benchmarks.

Technology Category

Application Category

📝 Abstract
Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video-based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real-world annotation for multimodal video understanding.
Problem

Research questions and friction points this paper is trying to address.

multimodal video understanding
synthetic data
data annotation
video understanding
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data generation
multimodal video understanding
unified pipeline
visual question answering
visual grounding
🔎 Similar Papers