Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing self-supervised image encoders (e.g., DINO) rely solely on static images and thus struggle to capture spatiotemporal and geometric priors inherent in video, limiting their effectiveness in physics-aware perception tasks. To address this, we propose a video self-distillation framework that leverages only a single two-hour unlabeled video clip to supervise a static image encoder in predicting features of the next frame. This enables implicit learning of temporal continuity and 3D spatial structure—without requiring auxiliary modules such as optical flow estimation or object tracking—and fully preserves compatibility with standard image-based pretraining pipelines. To our knowledge, this is the first method to endow purely image-pretrained encoders with robust spatiotemporal and geometric awareness using only short-duration video. On ADE20K semantic segmentation, our approach improves mIoU by 1.4 percentage points (35.0 → 36.4), demonstrating a substantial enhancement in modeling physical-world structure.

Technology Category

Application Category

📝 Abstract

Self-supervised image encoders such as DINO have recently gained significant interest for learning robust visual features without labels. However, most SSL methods train on static images and miss the temporal cues inherent in videos. We introduce a video-distilled single-image encoder trained to predict the next-frame representation from the current frame. This simple objective injects 3D spatial and temporal priors without optical flow or tracking. When pre-training on a single 2-hour video, our approach raises the mean Intersection-over-Union (mIoU) on ADE20K from 35.0 (DoRA) to 36.4 while remaining a drop-in replacement for image-only pipelines. Our results highlight video self-distillation as a lightweight route to geometry-aware perception an essential ingredient for physically plausible world models and Physical AI.

Problem

Research questions and friction points this paper is trying to address.

Enhancing single-image encoders with temporal cues from videos

Injecting 3D spatial and temporal priors without optical flow

Improving geometry-aware perception for physically plausible models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-distilled single-image encoder

Predict next-frame representation from current frame

No optical flow or tracking needed

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Toyota Research Institute

Los Altos, CA

AI Research Scientist, Computer Vision - Facebook Video Intelligence