🤖 AI Summary
Existing 4D perception methods rely on expensive and scarce 3D/4D ground-truth annotations of dynamic scenes, limiting their scalability. This work proposes SelfEvo, a framework that, for the first time, enables continual self-improvement of multi-view 4D perception models under fully unsupervised conditions. By introducing a self-distillation mechanism driven by spatiotemporal context asymmetry, SelfEvo leverages unlabeled video sequences to iteratively refine a pre-trained reconstruction model without requiring any external supervision. The framework is architecture-agnostic, demonstrating strong compatibility with diverse backbones such as VGGT and π³. Evaluated across eight benchmarks, SelfEvo achieves substantial performance gains, improving video depth estimation by up to 36.5% and camera pose estimation by 20.1%.
📝 Abstract
Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $π^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.