🤖 AI Summary
Existing video generation models struggle to accurately align with user creative intent when handling abstract, sparse, or complex conditions such as storyboards or clay renders. This work proposes a “Rein Architecture” that decouples controllable video generation into two stages: cognitive understanding of creative intent followed by video synthesis. Specifically, a specialized vision-language model, CogVLM, trained on professional anime data, first interprets the user’s intent; then, a unified multi-condition controlled video diffusion transformer, CogOmniDiT, generates the output, enhanced by a Best-of-N selection mechanism based on inference results. The study introduces CogReasonBench and CogControlBench—evaluation benchmarks grounded in real-world creative workflows—on which the proposed approach significantly outperforms existing open-source models, achieving superior professionalism, visual clarity, and intent consistency in generated videos.
📝 Abstract
Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/