🤖 AI Summary
Constructing customized video datasets from large-scale online video repositories is time-consuming, labor-intensive, and heavily reliant on manual annotation.
Method: This paper introduces VC-Agent—the first interactive intelligent video collection agent—designed to automate and optimize dataset curation. VC-Agent integrates multimodal large language models (MLLMs) to precisely align user intent with video semantics, supporting natural-language queries and real-time interactive feedback. It proposes a novel dynamic dual-filtering mechanism that jointly optimizes relevance and diversity while adaptively refining filtering criteria based on user confirmations. The system implements an end-to-end, iteratively optimized video collection pipeline.
Contribution/Results: Extensive experiments across diverse real-world scenarios demonstrate significant improvements in data collection efficiency. User studies validate VC-Agent’s usability and superior performance, achieving over 60% reduction in human annotation effort. VC-Agent establishes a new paradigm for constructing high-quality, task-specific video datasets with minimal human intervention.
📝 Abstract
Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor-intensive and time-consuming. In this work, we study the way to expedite this collection process and propose VC-Agent, the first interactive agent that is able to understand users' queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user-friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi-modal large language models to connect the user's requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent's usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection. Project page: https://allenyidan.github.io/vcagent_page/.