🤖 AI Summary
Video outpainting aims to generate coherent content beyond spatiotemporal boundaries, yet existing U-Net-based latent diffusion models suffer from limitations in generation quality and architectural adaptability. This paper introduces the first diffusion Transformer (DiT)-based video outpainting framework. Our method features: (1) a dual-branch architecture jointly modeling local spatial details and global temporal dynamics; (2) a mask-guided self-attention mechanism that explicitly steers attention toward valid contextual regions during outpainting; and (3) a latent-space alignment loss coupled with cross-segment iterative refinement to enforce spatiotemporal consistency. Evaluated on multiple benchmarks, our approach achieves zero-shot state-of-the-art performance, significantly improving visual fidelity, structural plausibility, and motion coherence compared to prior methods.
📝 Abstract
Video outpainting is a challenging task that generates new video content by extending beyond the boundaries of an original input video, requiring both temporal and spatial consistency. Many state-of-the-art methods utilize latent diffusion models with U-Net backbones but still struggle to achieve high quality and adaptability in generated content. Diffusion transformers (DiTs) have emerged as a promising alternative because of their superior performance. We introduce OutDreamer, a DiT-based video outpainting framework comprising two main components: an efficient video control branch and a conditional outpainting branch. The efficient video control branch effectively extracts masked video information, while the conditional outpainting branch generates missing content based on these extracted conditions. Additionally, we propose a mask-driven self-attention layer that dynamically integrates the given mask information, further enhancing the model's adaptability to outpainting tasks. Furthermore, we introduce a latent alignment loss to maintain overall consistency both within and between frames. For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content, ensuring temporal consistency across video clips. Extensive evaluations demonstrate that our zero-shot OutDreamer outperforms state-of-the-art zero-shot methods on widely recognized benchmarks.