Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This paper addresses the challenge of simultaneously improving in-distribution (ID) robustness and out-of-distribution (OOD)/zero-shot generalization in fine-tuning vision-language models (e.g., CLIP). We propose Difference Vector Equalization (DiVE), a novel method that enforces geometric consistency between pre-trained and fine-tuned embedding difference vectors. DiVE introduces two complementary losses: an Average Vector Loss (AVL) for global embedding space alignment and a Pairwise Vector Loss (PVL) for local stability—both grounded in contrastive learning. The framework integrates weighted average alignment and multimodal consistency regularization. Experiments demonstrate that DiVE significantly outperforms baselines across ID robustness, OOD generalization, and zero-shot classification. Notably, it is the first approach to enhance ID robustness without compromising cross-distribution generalization, establishing a new paradigm for robust multimodal model adaptation.

Technology Category

Application Category

📝 Abstract

Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure. Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.

Problem

Research questions and friction points this paper is trying to address.

Robustly fine-tunes vision-language models on in-distribution data

Preserves geometric structure of embeddings during fine-tuning

Maintains generalization in out-of-distribution and zero-shot settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

DiVE preserves geometric structure during fine-tuning

Constrains difference vectors to be equal across samples

Uses AVL and PVL losses for global and local preservation

🔎 Similar Papers

Light-weight Fine-tuning Method for Defending Adversarial Noise in Pre-trained Medical Vision-Language Models