CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite their success, vision-language models like CLIP exhibit significant limitations in fine-grained visual understanding. To address this, we propose a CLIP enhancement framework specifically designed for fine-grained semantic discrimination. First, we construct high-quality hard negative image-text pairs using image manipulation–oriented instruction data and introduce a symmetric contrastive loss to strengthen discrimination of subtle visual distinctions. Second, we incorporate a long-text description modeling module equipped with rotary position encoding to better capture complex visual-semantic contextual dependencies. Evaluated on fine-grained benchmarks such as MMVP, our method achieves substantial performance gains while preserving zero-shot classification and cross-modal retrieval capabilities—and notably mitigates visual hallucination in multimodal large language models. The core innovations lie in (i) an instruction-driven hard negative mining mechanism and (ii) a long-context-aware text encoding strategy.

Technology Category

Application Category

📝 Abstract
Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN's visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained visual understanding in CLIP models
Distinguishing subtle visual-semantic differences effectively
Reducing visual hallucinations in multimodal language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-editing datasets for hard negative pairs
Symmetric hard negative contrastive loss training
Long captions with rotary positional encodings
🔎 Similar Papers
No similar papers found.