🤖 AI Summary
Existing self-supervised image encoders (e.g., DINO) rely solely on static images and thus struggle to capture spatiotemporal and geometric priors inherent in video, limiting their effectiveness in physics-aware perception tasks. To address this, we propose a video self-distillation framework that leverages only a single two-hour unlabeled video clip to supervise a static image encoder in predicting features of the next frame. This enables implicit learning of temporal continuity and 3D spatial structure—without requiring auxiliary modules such as optical flow estimation or object tracking—and fully preserves compatibility with standard image-based pretraining pipelines. To our knowledge, this is the first method to endow purely image-pretrained encoders with robust spatiotemporal and geometric awareness using only short-duration video. On ADE20K semantic segmentation, our approach improves mIoU by 1.4 percentage points (35.0 → 36.4), demonstrating a substantial enhancement in modeling physical-world structure.
📝 Abstract
Self-supervised image encoders such as DINO have recently gained significant interest for learning robust visual features without labels. However, most SSL methods train on static images and miss the temporal cues inherent in videos. We introduce a video-distilled single-image encoder trained to predict the next-frame representation from the current frame. This simple objective injects 3D spatial and temporal priors without optical flow or tracking. When pre-training on a single 2-hour video, our approach raises the mean Intersection-over-Union (mIoU) on ADE20K from 35.0 (DoRA) to 36.4 while remaining a drop-in replacement for image-only pipelines. Our results highlight video self-distillation as a lightweight route to geometry-aware perception an essential ingredient for physically plausible world models and Physical AI.