MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-video (T2V) methods suffer from inaccurate attribute binding, ambiguous spatial relationships, and weak modeling of action-based interactions in multi-subject scenarios. To address these limitations, we propose a training-free, two-stage decoupled fine-tuning framework: (1) semantic anchor disambiguation enhances the semantic specificity of conditional inputs; (2) dynamic layout-fused attention enables precise spatiotemporal binding of subjects to video regions during denoising. Our approach establishes the first model-agnostic, plug-and-play compositional video generation enhancement paradigm, integrating mask-modulated attention, grounding-aware prior integration, and diffusion-condition optimization. Evaluated on T2V-CompBench and VBench, it significantly outperforms state-of-the-art methods, markedly improving complex prompt parsing, trajectory controllability, and structural consistency and fidelity in multi-subject interactive videos.

Technology Category

Application Category

📝 Abstract
Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: https://hong-yu-zhang.github.io/MagicComp-Page/.
Problem

Research questions and friction points this paper is trying to address.

Improves attribute binding in text-to-video generation
Enhances spatial relationship determination in video synthesis
Captures complex action interactions between multiple subjects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Anchor Disambiguation for text embedding
Dynamic Layout Fusion Attention for spatial binding
Training-free dual-phase refinement for T2V generation
🔎 Similar Papers
No similar papers found.
Hongyu Zhang
Hongyu Zhang
Chongqing University
Software EngineeringMining Software RepositoriesData-driven Software EngineeringSoftware Analytics
Yufan Deng
Yufan Deng
Oxford VGG
S
Shenghai Yuan
School of Electronic and Computer Engineering, Peking University, Shenzhen, China
P
Peng Jin
School of Electronic and Computer Engineering, Peking University, Shenzhen, China
Zesen Cheng
Zesen Cheng
Peking University
MLLMVideo LLMVisual GroundingImage/Video Segmentation
Yian Zhao
Yian Zhao
Peking University
3D Gaussian SplattingMLLM
C
Chang Liu
Tsinghua University, Beijing, China
J
Jie Chen
Peng Cheng Laboratory, Shenzhen, China