🤖 AI Summary
Existing video outpainting methods struggle to maintain intra- and inter-frame consistency in dynamic scenes and large-scale extrapolations due to implicit temporal modeling and limited spatial context. This work presents the first unified framework that integrates propagation and generation paradigms by introducing a latent propagation mechanism combining optical flow-based propagation with reference-guided synthesis, thereby preserving original visible content while producing spatiotemporally coherent and photorealistic outpainted results. We incorporate a pre-trained optical flow completion network and jointly optimize it within an end-to-end fine-tuned diffusion-based generative framework, significantly enhancing temporal consistency and generation reliability. Experiments demonstrate that our approach outperforms state-of-the-art methods in visual realism, temporal coherence, and inference efficiency, without requiring input-specific adaptation.
📝 Abstract
Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism with efficient inference, surpassing even prior state-of-the-art methods that require input-specific adaptation.