Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

📅 2024-08-19
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-video models suffer from significant deficiencies in physical plausibility, photorealistic lighting, camera motion, and temporal coherence, limiting their applicability to cinematic-grade synthesis. To address this, we propose the first multi-agent VLM framework tailored for high-fidelity 3D video generation, featuring decoupled Director, Programmer, and Reviewer agents. Our method decomposes the synthesis task, automatically generates Blender scripting code, and performs iterative optimization guided by vision-language feedback—enabling end-to-end, interpretable, and editable video generation. Deeply integrating cinematographic knowledge with a closed-loop 3D rendering pipeline, it produces high-fidelity videos fully aligned with textual prompts—without manual intervention. Experiments demonstrate superior performance over leading commercial models across five video quality and instruction-following metrics. User studies further confirm substantial improvements: +28.6% in physical plausibility, +31.2% in temporal consistency, and higher overall quality scores.

Technology Category

Application Category

📝 Abstract
Text-to-video generation has been dominated by diffusion-based or autoregressive models. These novel models provide plausible versatility, but are criticized for improper physical motion, shading and illumination, camera motion, and temporal consistency. The film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software. Human-directed 3D synthetic videos address these shortcomings, but require tight collaboration between movie makers and 3D rendering experts. We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a language description of a video, multiple VLM agents direct various processes of the generation pipeline. They cooperate to create Blender scripts which render a video following the given description. Augmented with Blender-based movie making knowledge, the Director agent decomposes the text-based video description into sub-processes. For each sub-process, the Programmer agent produces Python-based Blender scripts based on function composing and API calling. The Reviewer agent, with knowledge of video reviewing, character motion coordinates, and intermediate screenshots, provides feedback to the Programmer agent. The Programmer agent iteratively improves scripts to yield the best video outcome. Our generated videos show better quality than commercial video generation models in five metrics on video quality and instruction-following performance. Our framework outperforms other approaches in a user study on quality, consistency, and rationality.
Problem

Research questions and friction points this paper is trying to address.

Address improper motion and consistency in text-to-video generation
Reduce manual CGI editing in film industry workflows
Automate synthetic video creation via VLM agent collaboration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal VLM agents collaborate for video generation
Director decomposes text into Blender sub-processes
Programmer iteratively improves scripts via Reviewer feedback
🔎 Similar Papers
No similar papers found.