VDCook:DIY video data cook your MLLMs

πŸ“… 2026-03-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the high barrier, static nature, and limited evolvability of specialized video training datasets by proposing a configurable, self-evolving video data operating system. The system enables users to issue requests via natural language and parameters, automatically optimizing queries and executing parallel retrieval of real videos alongside controllable synthesis to produce domain-specific data packages enriched with full provenance, multidimensional metadata, and reproducible notebooks. Built upon the Model-Context Protocol (MCP), it establishes a dynamic data ecosystem that supports community contributions, governance-driven continuous updates, and flexible β€œcooking” mechanisms. Experiments demonstrate that this approach substantially reduces dataset construction costs and enhances the training efficiency and iterative capability of multimodal large models in vertical domains.

Technology Category

Application Category

πŸ“ Abstract
We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\cite{mcp2024anthropic}, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking'and indexing\cite{vlogger}. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbf{Project demo:} https://screenapp.io/app/v/WP0SvffgsH
Problem

Research questions and friction points this paper is trying to address.

video data construction
multimodal large language models
dataset evolution
domain-specific data
data infrastructure
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving video data
configurable data construction
retrieval-synthesis pipeline
multi-dimensional metadata annotation
MCP-based data ingestion
πŸ”Ž Similar Papers
No similar papers found.