Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

๐Ÿ“… 2026-05-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

216K/year
๐Ÿค– AI Summary
This work proposes a lightweight training paradigm for video foundation models that circumvents the need for computationally expensive, large-scale video pretraining. Instead of end-to-end training on massive video datasets, the approach freezes a pretrained image foundation model as a spatial encoder and trains only a recurrent temporal module to capture dynamics across video frames. By leveraging the strong spatial representations of existing image models, this strategy substantially reduces reliance on both video data and computational resources. Experimental results demonstrate that the proposed method achieves competitive performance across multiple video understanding benchmarks, establishing that effective temporal modeling can be attained without extensive video pretraining.
๐Ÿ“ Abstract
Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .
Problem

Research questions and friction points this paper is trying to address.

video foundation models
data-efficient pre-training
frozen image models
temporal reasoning
video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

video pre-training
frozen image foundation model
temporal reasoning
data efficiency
recurrent temporal module
๐Ÿ”Ž Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30