From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning

πŸ“… 2026-03-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of transferring image pre-trained models to video tasks, where maintaining temporal consistency across frames and semantic discriminability across videos is difficult to achieve simultaneously. To this end, the authors propose the Co-Settle framework, which introduces a lightweight projection layer atop a frozen image encoder and jointly optimizes the representation space through a temporal cycle-consistency loss and a semantic separability constraint. Co-Settle is the first method to explicitly model and theoretically analyze the trade-off between these two objectives. Remarkably, it achieves significant performance gains across multiple video understanding benchmarks after only five rounds of self-supervised training. Consistent improvements are observed across eight widely used image pre-trained models, demonstrating the framework’s efficiency and broad applicability.
πŸ“ Abstract
Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at https://github.com/yafeng19/Co-Settle.
Problem

Research questions and friction points this paper is trying to address.

image-to-video transfer
temporal consistency
semantic separability
representation learning
self-supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning
representation transfer
temporal consistency
semantic separability
lightweight projection
πŸ”Ž Similar Papers
No similar papers found.