Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work proposes a lightweight training paradigm for video foundation models that circumvents the need for computationally expensive, large-scale video pretraining. Instead of end-to-end training on massive video datasets, the approach freezes a pretrained image foundation model as a spatial encoder and trains only a recurrent temporal module to capture dynamics across video frames. By leveraging the strong spatial representations of existing image models, this strategy substantially reduces reliance on both video data and computational resources. Experimental results demonstrate that the proposed method achieves competitive performance across multiple video understanding benchmarks, establishing that effective temporal modeling can be attained without extensive video pretraining.

📝 Abstract

Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .

Problem

Research questions and friction points this paper is trying to address.

video foundation models

data-efficient pre-training

frozen image models

temporal reasoning

video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

video pre-training

frozen image foundation model

temporal reasoning

data efficiency

recurrent temporal module

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30