DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation

📅 2026-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of insufficient semantic, geometric, and identity consistency in video generation by systematically decomposing video coherence into three distinct subproblems: intra-clip semantic consistency, inter-clip camera motion consistency, and cross-shot element consistency. To tackle these, the authors propose a divide-and-conquer diffusion framework built upon a unified video generation backbone, integrating structured prompt parsing from large language models, noise-space temporal camera representations, text-to-image initialization, and a sparse cross-shot attention mechanism. Experimental results on the AAAI'26 CVM benchmark demonstrate that the proposed method significantly enhances semantic coherence, camera dynamics consistency, and cross-shot element stability in generated videos.

Technology Category

Application Category

📝 Abstract
Recent video generative models have demonstrated impressive visual fidelity, yet they often struggle with semantic, geometric, and identity consistency. In this paper, we propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM), to address three key challenges: (1) intra-clip world knowledge consistency, (2) inter-clip camera consistency, and (3) inter-shot element consistency. DCDM decomposes video consistency modeling under these scenarios into three dedicated components while sharing a unified video generation backbone. For intra-clip consistency, DCDM leverages a large language model to parse input prompts into structured semantic representations, which are subsequently translated into coherent video content by a diffusion transformer. For inter-clip camera consistency, we propose a temporal camera representation in the noise space that enables precise and stable camera motion control, along with a text-to-image initialization mechanism to further enhance controllability. For inter-shot consistency, DCDM adopts a holistic scene generation paradigm with windowed cross-attention and sparse inter-shot self-attention, ensuring long-range narrative coherence while maintaining computational efficiency. We validate our framework on the test set of the CVM Competition at AAAI'26, and the results demonstrate that the proposed strategies effectively address these challenges.
Problem

Research questions and friction points this paper is trying to address.

video generation
consistency
semantic consistency
camera consistency
inter-shot consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Divide-and-Conquer Diffusion
Video Consistency
Temporal Camera Control
Structured Semantic Representation
Sparse Inter-shot Attention
🔎 Similar Papers
No similar papers found.