🤖 AI Summary
This work addresses spatial blurriness and temporal inconsistency in video outpainting, particularly under challenging conditions such as limited camera motion or large outpainting regions, which stem from a mismatch between masking strategies during training and inference. To resolve this, the authors propose a unified masking strategy that applies masks of consistent direction and width across all video frames during training. Leveraging this strategy, they fine-tune a pre-trained M3DDM model, effectively eliminating the training–inference discrepancy. The approach significantly enhances both spatial sharpness and temporal coherence of outpainted videos while maintaining computational efficiency, demonstrating especially strong performance in information-scarce scenarios where traditional methods struggle.
📝 Abstract
M3DDM provides a computationally efficient framework for video outpainting via latent diffusion modeling. However, it exhibits significant quality degradation -- manifested as spatial blur and temporal inconsistency -- under challenging scenarios characterized by limited camera motion or large outpainting regions, where inter-frame information is limited. We identify the cause as a training-inference mismatch in the masking strategy: M3DDM's training applies random mask directions and widths across frames, whereas inference requires consistent directional outpainting throughout the video. To address this, we propose M3DDM+, which applies uniform mask direction and width across all frames during training, followed by fine-tuning of the pretrained M3DDM model. Experiments demonstrate that M3DDM+ substantially improves visual fidelity and temporal coherence in information-limited scenarios while maintaining computational efficiency. The code is available at https://github.com/tamaki-lab/M3DDM-Plus.