Training-free Motion Factorization for Compositional Video Generation

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge that existing methods struggle to accurately interpret diverse motion categories in textual prompts, limiting the quality of multi-instance video synthesis. The authors propose a training-free motion decomposition framework that disentangles complex motion into three canonical types: static, rigid, and non-rigid. Adopting a “plan-then-generate” paradigm, the approach first infers instance-level shape and positional dynamics through a motion graph during the planning phase, then modulates each motion type in a disentangled manner during generation. This method achieves, for the first time, training-free disentanglement of motion categories, introduces motion graph–guided structured semantic representations, and incorporates model-agnostic modules compatible with various diffusion architectures. Experiments demonstrate significant improvements in motion synthesis quality on real-world benchmarks, effectively enabling compositional video generation with multiple instances, appearances, and motion types.

Technology Category

Application Category

📝 Abstract

Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Our code will be released soon.

Problem

Research questions and friction points this paper is trying to address.

compositional video generation

motion factorization

motion categories

semantic ambiguity

video synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

motion factorization

compositional video generation

model-agnostic framework