Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves

πŸ“… 2024-12-16
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address parameter redundancy and high computational overhead in downstream adaptation of vision-language models (VLMs), particularly in prompt tuning and adapter-based methods, this paper proposes Skip Tuningβ€”a parameter-free paradigm that eliminates the need for additional parameters, context vectors, or adapter modules. Instead, it introduces Layer Skipping (LSkip) and Class Skipping (CSkip) to strategically prune gradient propagation paths, thereby reducing both the length and width of the gradient flow during full fine-tuning. The key insight is that freezing backbone parameters is not the most effective means of efficient transfer; rather, optimizing the gradient flow structure offers greater potential. Extensive experiments across multiple benchmarks demonstrate that Skip Tuning consistently outperforms state-of-the-art lightweight tuning approaches: it reduces training memory by 37%, inference latency by 29%, while maintaining or improving accuracy. This work achieves, for the first time, parameter-free, low-overhead, yet highly expressive VLM adaptation.

Technology Category

Application Category

πŸ“ Abstract
Prompt tuning (PT) has long been recognized as an effective and efficient paradigm for transferring large pre-trained vision-language models (VLMs) to downstream tasks by learning a tiny set of context vectors. Nevertheless, in this work, we reveal that freezing the parameters of VLMs during learning the context vectors neither facilitates the transferability of pre-trained knowledge nor improves the memory and time efficiency significantly. Upon further investigation, we find that reducing both the length and width of the feature-gradient propagation flows of the full fine-tuning (FT) baseline is key to achieving effective and efficient knowledge transfer. Motivated by this, we propose Skip Tuning, a novel paradigm for adapting VLMs to downstream tasks. Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules. Extensive experiments across a wide spectrum of benchmarks demonstrate the superior effectiveness and efficiency of our Skip Tuning over both PT and adapter-based methods. Code: https://github.com/Koorye/SkipTuning.
Problem

Research questions and friction points this paper is trying to address.

Improves efficiency of vision-language model adaptation.
Reduces feature-gradient propagation flow complexity.
Enhances transferability without extra context vectors.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Skip Tuning adapts VLMs without extra vectors
Layer-wise and Class-wise Skipping enhance efficiency
Reduces feature-gradient propagation flows effectively
πŸ”Ž Similar Papers
No similar papers found.
Shihan Wu
Shihan Wu
MS Student, University of Electronic Science and Technology of China
Computer VisionVision-Language ModelsRobotics
J
Ji Zhang
Southwest Jiaotong University
Pengpeng Zeng
Pengpeng Zeng
Tongji University
computer vision
Lianli Gao
Lianli Gao
UESTC
Vision and Language
J
Jingkuan Song
Tongji University
H
Heng Tao Shen
University of Electronic Science and Technology of China, Tongji University