Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study addresses a previously underexplored limitation of the CLIP model—its tendency to exhibit “central bias,” wherein excessive attention is allocated to the center of images at the expense of peripheral regions, thereby impairing fine-grained visual understanding. The authors formally identify and name this phenomenon, tracing its origin through embedding decomposition and attention map analysis to the information loss induced by spatial pooling mechanisms. To mitigate this issue without requiring model retraining, they propose a training-free strategy that integrates visual prompting with attention redistribution to effectively redirect the model’s focus toward non-central areas. Experimental results demonstrate that the proposed approach significantly enhances CLIP’s recognition performance on boundary-located objects while preserving its original architecture and parameters.

Technology Category

Application Category

📝 Abstract

Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model's embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models' attention to off-center regions.

Problem

Research questions and friction points this paper is trying to address.

center bias

CLIP

visual grounding

off-center objects

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

center bias

CLIP

visual prompting