๐ค AI Summary
This work proposes a lightweight training paradigm for video foundation models that circumvents the need for computationally expensive, large-scale video pretraining. Instead of end-to-end training on massive video datasets, the approach freezes a pretrained image foundation model as a spatial encoder and trains only a recurrent temporal module to capture dynamics across video frames. By leveraging the strong spatial representations of existing image models, this strategy substantially reduces reliance on both video data and computational resources. Experimental results demonstrate that the proposed method achieves competitive performance across multiple video understanding benchmarks, establishing that effective temporal modeling can be attained without extensive video pretraining.
๐ Abstract
Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .