FRAME: Pre-Training Video Feature Representations via Anticipation and Memory

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Existing video encoders face bottlenecks in dense prediction tasks (e.g., object tracking, semantic segmentation): image encoders (e.g., DINO, CLIP) lack temporal modeling capability, while video models (e.g., VideoMAE) yield coarse spatial representations unsuitable for pixel-level alignment. To address this, we propose FRAME—a self-supervised frame encoder tailored for dense video understanding. FRAME takes historical and current RGB frames as input and jointly predicts DINO patch features for the current and future frames. It employs CLIP-based semantic space alignment and a lightweight temporal memory mechanism to produce token-level representations that are spatiotemporally consistent, spatially precise, and language-aligned. FRAME outperforms DINO, CLIP, and VideoMAE across seven datasets and six dense prediction tasks. Moreover, it generalizes effectively to cross-modal tasks such as video classification, achieving both compactness and strong transferability.

Technology Category

Application Category

📝 Abstract

Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.

Problem

Research questions and friction points this paper is trying to address.

Develops a video encoder for dense prediction tasks

Improves temporal consistency in video feature representation

Outperforms image encoders in fine-grained visual correspondence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts current and future DINO patch features

Leverages image-based models for dense prediction

Aligns class token with CLIP's semantic space

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4