OutDreamer: Video Outpainting with a Diffusion Transformer

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video outpainting aims to generate coherent content beyond spatiotemporal boundaries, yet existing U-Net-based latent diffusion models suffer from limitations in generation quality and architectural adaptability. This paper introduces the first diffusion Transformer (DiT)-based video outpainting framework. Our method features: (1) a dual-branch architecture jointly modeling local spatial details and global temporal dynamics; (2) a mask-guided self-attention mechanism that explicitly steers attention toward valid contextual regions during outpainting; and (3) a latent-space alignment loss coupled with cross-segment iterative refinement to enforce spatiotemporal consistency. Evaluated on multiple benchmarks, our approach achieves zero-shot state-of-the-art performance, significantly improving visual fidelity, structural plausibility, and motion coherence compared to prior methods.

Technology Category

Application Category

📝 Abstract
Video outpainting is a challenging task that generates new video content by extending beyond the boundaries of an original input video, requiring both temporal and spatial consistency. Many state-of-the-art methods utilize latent diffusion models with U-Net backbones but still struggle to achieve high quality and adaptability in generated content. Diffusion transformers (DiTs) have emerged as a promising alternative because of their superior performance. We introduce OutDreamer, a DiT-based video outpainting framework comprising two main components: an efficient video control branch and a conditional outpainting branch. The efficient video control branch effectively extracts masked video information, while the conditional outpainting branch generates missing content based on these extracted conditions. Additionally, we propose a mask-driven self-attention layer that dynamically integrates the given mask information, further enhancing the model's adaptability to outpainting tasks. Furthermore, we introduce a latent alignment loss to maintain overall consistency both within and between frames. For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content, ensuring temporal consistency across video clips. Extensive evaluations demonstrate that our zero-shot OutDreamer outperforms state-of-the-art zero-shot methods on widely recognized benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Extends video boundaries with temporal and spatial consistency
Improves quality and adaptability in generated video content
Ensures consistency in long videos through iterative refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiT-based video outpainting framework
Mask-driven self-attention layer
Cross-video-clip refiner for consistency
🔎 Similar Papers
No similar papers found.