un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
CLIP exhibits limited capacity for fine-grained visual discrimination and dense prediction due to insufficient detail representation. To address this, we propose un²CLIP—a novel framework that, for the first time, injects prior knowledge from the generative model unCLIP into the CLIP image encoder via reverse decomposition, significantly enhancing visual detail capture without compromising cross-modal alignment. Our method comprises three key components: (i) conditional generation modeling grounded in unCLIP, (ii) reverse fine-tuning of the CLIP image encoder, and (iii) cross-modal embedding space consistency constraints. Extensive experiments demonstrate that un²CLIP consistently outperforms both the original CLIP and state-of-the-art variants across diverse benchmarks—including MMVP-VLM, open-vocabulary segmentation, and multimodal large language models—establishing a new paradigm for fine-grained vision-language understanding.

Technology Category

Application Category

📝 Abstract
Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the conditional input space of unCLIP aligns with CLIP's original image-text embedding space. Therefore, we propose to invert unCLIP (dubbed un$^2$CLIP) to improve the CLIP model. In this way, the improved image encoder can gain unCLIP's visual detail capturing ability while preserving its alignment with the original text encoder simultaneously. We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un$^2$CLIP significantly improves the original CLIP and previous CLIP improvement methods. Code and models will be available at https://github.com/LiYinqi/un2CLIP.
Problem

Research questions and friction points this paper is trying to address.

Enhances CLIP's ability to capture visual details
Improves performance on dense-prediction tasks
Maintains alignment with original text encoder
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inverts unCLIP to enhance CLIP details
Aligns generative and discriminative model spaces
Improves CLIP's dense-prediction and multimodal tasks
🔎 Similar Papers
No similar papers found.
Y
Yinqi Li
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
J
Jiahe Zhao
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Hong Chang
Hong Chang
Researcher at Institute of Computing Technology, Chinese Academy of Sciences
Machine LearningComputer VisionPattern Recognition
Ruibing Hou
Ruibing Hou
Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionDeep Learning
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition
X
Xilin Chen
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences