MERGETUNE: Continued fine-tuning of vision-language models

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes MERGETUNE, a novel continual fine-tuning (CFT) paradigm based on linear mode connectivity (LMC), to mitigate catastrophic forgetting of pre-trained knowledge in vision-language models such as CLIP. Without replaying pre-training data or altering model architecture, MERGETUNE post-processes trainable parameters—such as soft prompts or linear heads—via a second-order proxy constraint to implicitly fuse zero-shot and fine-tuned solutions. This approach achieves the first effective recovery of pre-trained knowledge after fine-tuning, consistently outperforming the original CLIP across multiple benchmarks. Notably, it improves the harmonic mean of CoOp by 5.6% on base-novel generalization tasks and, for the first time, surpasses CLIP simultaneously on both DTD and EuroSAT cross-dataset transfer settings, while also exceeding ensemble baselines at lower inference cost.

Technology Category

Application Category

📝 Abstract
Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6% on base-novel generalisation without adding parameters. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at https://github.com/Surrey-UP-Lab/MERGETUNE.
Problem

Research questions and friction points this paper is trying to address.

catastrophic forgetting
vision-language models
fine-tuning
pretrained knowledge recovery
zero-shot adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

continued fine-tuning
linear mode connectivity
catastrophic forgetting
vision-language models
model merging