Know"No"Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

📅 2025-01-19

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

CLIP exhibits poor negation understanding—e.g., interpreting “no parking”—due to severe underrepresentation of negative samples in its pretraining data. This work is the first to systematically identify this data bias as the root cause of CLIP’s negation deficiency and introduces NegRefCOCOg, the first fine-grained benchmark for evaluating negation comprehension in vision-language grounding. To address the issue, we propose a novel pipeline leveraging large language models (LLMs) and multimodal large language models (MLLMs) to synthesize and filter high-quality negation-aware image-text pairs. Using these data, we perform lightweight contrastive learning fine-tuning of CLIP, yielding NegationCLIP. Experiments demonstrate that NegationCLIP achieves significant gains on negation understanding tasks while preserving CLIP’s original generalization capability. Moreover, it consistently improves downstream performance in text-to-image generation and referring image segmentation, validating its broad applicability.

Technology Category

Application Category

📝 Abstract

While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like"parking"from"no parking"- poses substantial challenges. By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions. Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality. Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg-a benchmark tailored to test VLMs' ability to interpret negation across diverse expressions and positions within a sentence. Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately. Additionally, NegationCLIP's enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation.

Problem

Research questions and friction points this paper is trying to address.

CLIP Model

Negation Understanding

Data Limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

NegationCLIP

Large Language Models

Negation Understanding

🔎 Similar Papers

No similar papers found.