🤖 AI Summary
To address the challenge of open-text-description-driven complex-transition video stylization, this paper introduces the first multimodal large language model (MLLM)-based three-agent collaborative system: a Video Parser that automatically segments videos and generates per-shot content prompts; a Style Parser that performs fuzzy style-semantic matching via Tree-of-Thought search; and a Style Artist that achieves detail-adaptive rendering through multi-round self-reflection. Key contributions include: (1) a novel MLLM-based multi-agent collaboration and self-reflection paradigm; (2) video-to-shot prompt generation; (3) tree-structured style search; and (4) multi-round self-reflection rendering. We further construct TVSBench—the first open-query-oriented, text-driven video stylization benchmark. On TVSBench, our method outperforms FRESCO and ControlVideo by 6.05% and 4.51% in average metrics, respectively, establishing new state-of-the-art performance and significantly improving semantic alignment and visual fidelity for complex videos under open-ended style descriptions.
📝 Abstract
Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content. Via a concise video-to-shot prompting paradigm, it allows our V-Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree. Via a robust tree-of-thought searching paradigm, it allows our V-Stylist to precisely specify vague style preference in the open user query. (3) Style Artist leverages the matched model to render all the video shots into the required style. Via a novel multi-round self-reflection paradigm, it allows our V-Stylist to adaptively adjust detail control, according to the style requirement. With such a distinct design of mimicking human professionals, our V-Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover,we further construct a new benchmark Text-driven Video Stylization Benchmark (TVSBench), which fills the gap to assess stylization of complex videos on open user queries. Extensive experiments show that, V-Stylist achieves the state-of-the-art, e.g.,V-Stylist surpasses FRESCO and ControlVideo by 6.05% and 4.51% respectively in overall average metrics, marking a significant advance in video stylization.