Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models

📅 2025-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of adapting vision-language models (VLMs) downstream while keeping pretrained weights frozen, this paper proposes Masked Fine-Tuning (MFT): a parameter-efficient paradigm that updates no model parameters. Instead, it introduces learnable gating scores into the language module and image–text projector to dynamically mask and reconfigure existing weight connections, thereby enabling structured subnetwork reconfiguration. This is the first work to apply masked fine-tuning to VLMs, demonstrating that merely “rewiring” frozen backbone connections—without modifying the visual encoder—suffices to unlock strong task-specific adaptability. MFT is compatible with multilingual backbones and consistently outperforms LoRA variants and full fine-tuning across multiple VLM benchmarks, while preserving the visual encoder entirely frozen. The method achieves superior performance with zero trainable parameters in the vision tower, offering a novel perspective on structural adaptation of frozen multimodal foundations. Code is publicly available.

Technology Category

Application Category

📝 Abstract
Explorations in fine-tuning Vision-Language Models (VLMs), such as Low-Rank Adaptation (LoRA) from Parameter Efficient Fine-Tuning (PEFT), have made impressive progress. However, most approaches rely on explicit weight updates, overlooking the extensive representational structures already encoded in pre-trained models that remain underutilized. Recent works have demonstrated that Mask Fine-Tuning (MFT) can be a powerful and efficient post-training paradigm for language models. Instead of updating weights, MFT assigns learnable gating scores to each weight, allowing the model to reorganize its internal subnetworks for downstream task adaptation. In this paper, we rethink fine-tuning for VLMs from a structural reparameterization perspective grounded in MFT. We apply MFT to the language and projector components of VLMs with different language backbones and compare against strong PEFT baselines. Experiments show that MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone. Our findings reveal that effective adaptation can emerge not only from updating weights but also from reestablishing connections among the model's existing knowledge. Code available at: https://github.com/Ming-K9/MFT-VLM
Problem

Research questions and friction points this paper is trying to address.

Unlocking hidden capabilities in vision-language models
Reorganizing internal subnetworks for task adaptation
Achieving high performance without altering frozen backbone
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask Fine-Tuning assigns learnable gating scores to weights
Method reorganizes internal subnetworks without updating model weights
Approach reestablishes connections among existing knowledge in models
🔎 Similar Papers
No similar papers found.
M
Mingyuan Zhang
College of Engineering, Northeastern University
Yue Bai
Yue Bai
Northwestern University, Northeastern University
Multi-modal learningSparse network trainingMask learning
Y
Yifan Wang
College of Engineering, Northeastern University
Y
Yiyang Huang
College of Engineering, Northeastern University
Y
Yun Fu
Khoury College of Computer Science, Northeastern University