Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data

📅 2023-05-09

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

To address the challenge of improving vision-language models (e.g., CLIP) without introducing new data or full-model retraining, this paper proposes HELIP: a method that enables efficient continual fine-tuning via dynamic hard sample mining and weighted parameter updates within a contrastive learning framework—using only the original training dataset. Its key innovation lies in being the first approach to exclusively leverage *intra-dataset* hard example mining, eliminating reliance on external data or architectural modifications. HELIP is plug-and-play compatible with both CLIP and SLIP pipelines, requiring minimal code changes. On ImageNet, just two training epochs yield up to a 10.1% absolute gain in SLIP’s zero-shot accuracy. Across fine-grained classification benchmarks, HELIP improves zero-shot performance by 8.4–18.6% (average) and linear probe accuracy by 3.0–9.5%, demonstrating consistent and substantial gains without additional supervision or model reinitialization.

📝 Abstract

Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning. Improving CLIP typically requires additional data and retraining with new loss functions, but these demands raise resource and time costs, limiting practical use. In this work, we introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training. This eliminates the need for additional data or extensive retraining. Moreover, HELIP integrates effortlessly into current training pipelines with minimal code modifications, allowing for quick and seamless implementation. On comprehensive benchmarks, HELIP consistently boosts existing models. In particular, within just two epochs of training, it improves zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M, and YFCC15M datasets by 3.05%, 4.47%, and 10.1% , respectively. In addition, on fine-grained classification datasets, HELIP improves the zero-shot performance of CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%. The code is publicly available at: https://github.com/haonan3/HELIP-NACCL-2025.git.

Problem

Research questions and friction points this paper is trying to address.

Enhances CLIP without extra data

Refines hard text-image pairs

Improves zero-shot classification accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances CLIP without extra data

Uses challenging text-image pairs

Integrates easily into existing pipelines

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling