🤖 AI Summary
This work addresses the limitation imposed by low-quality image–text pairs on CLIP’s performance. We propose an LVLM-driven self-amplifying cyclic paradigm: leveraging a large vision-language model (LVLM) to generate multi-granular textual descriptions—including positive/negative and short/long variants—to construct the high-quality VLM-150M dataset (built upon DFN-Large). We introduce negative descriptions and short labels as novel supervisory signals, extending the contrastive learning objective to jointly optimize data quality and model training. The resulting framework achieves state-of-the-art performance on zero-shot classification, cross-modal retrieval, and fine-grained visual understanding. Notably, its retrieval accuracy surpasses that of a standard CLIP model trained on ten times more data, empirically validating the effectiveness of our closed-loop optimization—where models enhance data curation (“model-to-data”) and improved data strengthens model learning (“data-to-model”).
📝 Abstract
Large-scale but noisy image-text pair data have paved the way for the success of Contrastive Language-Image Pretraining (CLIP). As the foundation vision encoder, CLIP in turn serves as the cornerstone for most large vision-language models (LVLMs). This interdependence naturally raises an interesting question: Can we reciprocally leverage LVLMs to enhance the quality of image-text pair data, thereby opening the possibility of a self-reinforcing cycle for continuous improvement? In this work, we take a significant step toward this vision by introducing an LVLM-driven data refinement pipeline. Our framework leverages LVLMs to process images and their raw alt-text, generating four complementary textual formulas: long positive descriptions, long negative descriptions, short positive tags, and short negative tags. Applying this pipeline to the curated DFN-Large dataset yields VLM-150M, a refined dataset enriched with multi-grained annotations. Based on this dataset, we further propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags as additional supervised signals. The resulting model, namely HQ-CLIP, demonstrates remarkable improvements across diverse benchmarks. Within a comparable training data scale, our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks. In retrieval benchmarks, HQ-CLIP even surpasses standard CLIP models trained on the DFN-2B dataset, which contains 10$ imes$ more training data than ours. All code, data, and models are available at https://zxwei.site/hqclip.