DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-video diffusion models suffer from layout discontinuities, identity drift, and interaction distortions in multi-entity dynamic scenes—stemming from the lack of spatiotemporal constraints and physics-awareness in cross-attention mechanisms. To address this, we propose the first training-free, frame-aware control framework that enables precise spatiotemporal content regulation over off-the-shelf video diffusion models (e.g., CogVideoX-5B). Our method integrates an LLM-driven dynamic layout planner—incorporating trajectory optimization and entity graph parsing—with a dual-prompt attention masking mechanism and an entity-consistent feature propagation strategy. Crucially, it requires no model fine-tuning. Experimental results demonstrate substantial improvements in entity identity stability, spatial relationship consistency, and physically plausible interactions under complex compositional prompts. This work bridges a critical gap in training-free video generation by simultaneously enhancing controllability and physical plausibility.

Technology Category

Application Category

📝 Abstract
Compositional text-to-video generation, which requires synthesizing dynamic scenes with multiple interacting entities and precise spatial-temporal relationships, remains a critical challenge for diffusion-based models. Existing methods struggle with layout discontinuity, entity identity drift, and implausible interaction dynamics due to unconstrained cross-attention mechanisms and inadequate physics-aware reasoning. To address these limitations, we propose DyST-XL, a extbf{training-free} framework that enhances off-the-shelf text-to-video models (e.g., CogVideoX-5B) through frame-aware control. DyST-XL integrates three key innovations: (1) A Dynamic Layout Planner that leverages large language models (LLMs) to parse input prompts into entity-attribute graphs and generates physics-aware keyframe layouts, with intermediate frames interpolated via trajectory optimization; (2) A Dual-Prompt Controlled Attention Mechanism that enforces localized text-video alignment through frame-aware attention masking, achieving the precise control over individual entities; and (3) An Entity-Consistency Constraint strategy that propagates first-frame feature embeddings to subsequent frames during denoising, preserving object identity without manual annotation. Experiments demonstrate that DyST-XL excels in compositional text-to-video generation, significantly improving performance on complex prompts and bridging a crucial gap in training-free video synthesis.
Problem

Research questions and friction points this paper is trying to address.

Enhance text-to-video generation for complex dynamic scenes
Address layout discontinuity and entity identity drift issues
Improve physics-aware reasoning in video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Layout Planner with LLM parsing
Dual-Prompt Controlled Attention Mechanism
Entity-Consistency Constraint strategy
🔎 Similar Papers
No similar papers found.
W
Weijie He
Zhejiang University
Mushui Liu
Mushui Liu
Zhejiang University
Generative ModelsMulti-modal LearningFew-shot Learning
Y
Yunlong Yu
Zhejiang University
Z
Zhao Wang
Zhejiang University
C
Chao Wu
Zhejiang University