Training-free Motion Factorization for Compositional Video Generation

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing methods struggle to accurately interpret diverse motion categories in textual prompts, limiting the quality of multi-instance video synthesis. The authors propose a training-free motion decomposition framework that disentangles complex motion into three canonical types: static, rigid, and non-rigid. Adopting a “plan-then-generate” paradigm, the approach first infers instance-level shape and positional dynamics through a motion graph during the planning phase, then modulates each motion type in a disentangled manner during generation. This method achieves, for the first time, training-free disentanglement of motion categories, introduces motion graph–guided structured semantic representations, and incorporates model-agnostic modules compatible with various diffusion architectures. Experiments demonstrate significant improvements in motion synthesis quality on real-world benchmarks, effectively enabling compositional video generation with multiple instances, appearances, and motion types.

Technology Category

Application Category

📝 Abstract
Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Our code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

compositional video generation
motion factorization
motion categories
semantic ambiguity
video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

motion factorization
compositional video generation
model-agnostic framework
motion disentanglement
planning-before-generation
Z
Zixuan Wang
Sichuan University
Z
Ziqin Zhou
The University of Adelaide
F
Feng Chen
The University of Adelaide
Duo Peng
Duo Peng
Nanyang Technological University
Computer VisionDomain AdaptationGenerative AI
Y
Yixin Hu
Sichuan University
Changsheng Li
Changsheng Li
Beijing Institute of Technology
Flexible roboticsMechanical DesignRoboticsMedical RoboticsSurgical Robotics
Y
Yinjie Lei
Sichuan University