HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation imposed by low-quality image–text pairs on CLIP’s performance. We propose an LVLM-driven self-amplifying cyclic paradigm: leveraging a large vision-language model (LVLM) to generate multi-granular textual descriptions—including positive/negative and short/long variants—to construct the high-quality VLM-150M dataset (built upon DFN-Large). We introduce negative descriptions and short labels as novel supervisory signals, extending the contrastive learning objective to jointly optimize data quality and model training. The resulting framework achieves state-of-the-art performance on zero-shot classification, cross-modal retrieval, and fine-grained visual understanding. Notably, its retrieval accuracy surpasses that of a standard CLIP model trained on ten times more data, empirically validating the effectiveness of our closed-loop optimization—where models enhance data curation (“model-to-data”) and improved data strengthens model learning (“data-to-model”).

Technology Category

Application Category

📝 Abstract
Large-scale but noisy image-text pair data have paved the way for the success of Contrastive Language-Image Pretraining (CLIP). As the foundation vision encoder, CLIP in turn serves as the cornerstone for most large vision-language models (LVLMs). This interdependence naturally raises an interesting question: Can we reciprocally leverage LVLMs to enhance the quality of image-text pair data, thereby opening the possibility of a self-reinforcing cycle for continuous improvement? In this work, we take a significant step toward this vision by introducing an LVLM-driven data refinement pipeline. Our framework leverages LVLMs to process images and their raw alt-text, generating four complementary textual formulas: long positive descriptions, long negative descriptions, short positive tags, and short negative tags. Applying this pipeline to the curated DFN-Large dataset yields VLM-150M, a refined dataset enriched with multi-grained annotations. Based on this dataset, we further propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags as additional supervised signals. The resulting model, namely HQ-CLIP, demonstrates remarkable improvements across diverse benchmarks. Within a comparable training data scale, our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks. In retrieval benchmarks, HQ-CLIP even surpasses standard CLIP models trained on the DFN-2B dataset, which contains 10$ imes$ more training data than ours. All code, data, and models are available at https://zxwei.site/hqclip.
Problem

Research questions and friction points this paper is trying to address.

Enhancing image-text data quality using LVLMs
Creating multi-grained annotations for refined datasets
Improving CLIP models with negative descriptions and tags
Innovation

Methods, ideas, or system contributions that make the work stand out.

LVLM-driven data refinement pipeline
Multi-grained annotations for enriched datasets
Extended contrastive learning with negative descriptions
Z
Zhixiang Wei
University of Science and Technology of China, WeChat Vision, Tencent Inc.
Guangting Wang
Guangting Wang
University of Science and Technology of China
Computer vision
Xiaoxiao Ma
Xiaoxiao Ma
Oracle, Macquarie University
LLMdeep generative modelsanomaly detectiongraph neural networks
Ke Mei
Ke Mei
Tencent Wechat
deep learningcomputer vision
H
Huaian Chen
University of Science and Technology of China
Y
Yi Jin
University of Science and Technology of China
F
Fengyun Rao
WeChat Vision, Tencent Inc.