🤖 AI Summary
Existing video stabilization methods suffer from geometric distortions, excessive cropping, and poor generalization. This paper proposes a novel stabilization framework based on 3D scene reconstruction: it pioneers the integration of Gaussian Splatting into stabilization, enabling test-time adaptive local 3D scene representation; achieves temporally consistent jitter removal via dynamically aware multi-view photometric supervision and cross-frame pose regularization; and incorporates a scene extrapolation module to mitigate boundary cropping. Evaluated on a newly constructed 3D-aware video dataset, our method significantly outperforms mainstream 2D and 2.5D approaches—yielding substantial gains in quantitative metrics (e.g., PSNR, SSIM), enhanced geometric consistency, and markedly improved perceptual quality as validated by user studies. The core contribution lies in the first explicit coupling of 3D reconstruction with video stabilization, jointly preserving motion intent and spatiotemporal fidelity.
📝 Abstract
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent. Existing approaches, depending on the domain they operate, suffer from several issues (e.g. geometric distortions, excessive cropping, poor generalization) that degrade the user experience. To address these issues, we introduce extbf{GaVS}, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent `local reconstruction and rendering' paradigm. Given 3D camera poses, we augment a reconstruction model to predict Gaussian Splatting primitives, and finetune it at test-time, with multi-view dynamics-aware photometric supervision and cross-frame regularization, to produce temporally-consistent local reconstructions. The model are then used to render each stabilized frame. We utilize a scene extrapolation module to avoid frame cropping. Our method is evaluated on a repurposed dataset, instilled with 3D-grounded information, covering samples with diverse camera motions and scene dynamics. Quantitatively, our method is competitive with or superior to state-of-the-art 2D and 2.5D approaches in terms of conventional task metrics and new geometry consistency. Qualitatively, our method produces noticeably better results compared to alternatives, validated by the user study.