🤖 AI Summary
This work addresses the challenge of video outpainting, which requires balancing spatial plausibility in individual frames with long-term temporal consistency—particularly when camera or object motion causes outpainted regions to become visible across multiple frames. To this end, the authors propose a hierarchical diffusion framework that first outpaints keyframes and then generates intermediate frames via conditional interpolation, thereby mitigating error accumulation. The method incorporates an enhanced spatiotemporal module and a global feature guidance mechanism, leveraging 3D window-based attention to strengthen spatiotemporal interactions. Additionally, a dedicated extractor compresses full-frame OpenCLIP features into compact global tokens. Built upon a pretrained image inpainting backbone, the approach significantly outperforms existing methods on standard benchmarks, achieving notable advances in both reconstruction fidelity and temporal coherence.
📝 Abstract
Video outpainting extends a video beyond its original boundaries by synthesizing missing border content. Compared with image outpainting, it requires not only per-frame spatial plausibility but also long-range temporal coherence, especially when outpainted content becomes visible across time under camera or object motion. We propose GlobalPaint, a diffusion-based framework for spatiotemporal coherent video outpainting. Our approach adopts a hierarchical pipeline that first outpaints key frames and then completes intermediate frames via an interpolation model conditioned on the completed boundaries, reducing error accumulation in sequential processing. At the model level, we augment a pretrained image inpainting backbone with (i) an Enhanced Spatial-Temporal module featuring 3D windowed attention for stronger spatiotemporal interaction, and (ii) global feature guidance that distills OpenCLIP features from observed regions across all frames into compact global tokens using a dedicated extractor. Comprehensive evaluations on benchmark datasets demonstrate improved reconstruction quality and more natural motion compared to prior methods. Our demo page is https://yuemingpan.github.io/GlobalPaint/