Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of large-scale data and high computational cost in video-text learning by systematically investigating effective transfer of image-language foundation models (ILFMs) to the video domain. We propose the first classification framework for image-to-video transfer learning, explicitly distinguishing between “frozen-feature” and “modified-feature” paradigms, and covering tasks from fine-grained (e.g., spatiotemporal grounding) to coarse-grained (e.g., video question answering). Through multi-task empirical analysis, we quantitatively evaluate the performance ceilings and generalization capabilities of diverse transfer strategies on downstream tasks. Our study establishes a structured roadmap for video-text learning, identifying key bottlenecks—including insufficient temporal modeling and weak cross-modal alignment—and highlighting promising future directions such as lightweight adaptation and dynamic feature disentanglement.

Technology Category

Application Category

📝 Abstract
Image-Language Foundation Models (ILFM) have demonstrated remarkable success in image-text understanding/generation tasks, providing transferable multimodal representations that generalize across diverse downstream image-based tasks. The advancement of video-text research has spurred growing interest in extending image-based models to the video domain. This paradigm, known as image-to-video transfer learning, succeeds in alleviating the substantial data and computational requirements associated with training video-language foundation models from scratch for video-text learning. This survey provides the first comprehensive review of this emerging field, which begins by summarizing the widely used ILFM and their capabilities. We then systematically classify existing image-to-video transfer learning strategies into two categories: frozen features and modified features, depending on whether the original representations from ILFM are preserved or undergo modifications. Building upon the task-specific nature of image-to-video transfer, this survey methodically elaborates these strategies and details their applications across a spectrum of video-text learning tasks, ranging from fine-grained (e.g., spatio-temporal video grounding) to coarse-grained (e.g., video question answering). We further present a detailed experimental analysis to investigate the efficacy of different image-to-video transfer learning paradigms on a range of downstream video understanding tasks. Finally, we identify prevailing challenges and highlight promising directions for future research. By offering a comprehensive and structured overview, this survey aims to establish a structured roadmap for advancing video-text learning based on existing ILFM, and to inspire future research directions in this rapidly evolving domain.
Problem

Research questions and friction points this paper is trying to address.

Extending image-language models to video-text learning tasks
Reducing data and computation needs for video foundation models
Systematically classifying image-to-video transfer learning strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transfer learning from image to video models
Using frozen or modified image-language features
Reducing data and computation for video tasks