🤖 AI Summary
This study addresses a previously underexplored limitation of the CLIP model—its tendency to exhibit “central bias,” wherein excessive attention is allocated to the center of images at the expense of peripheral regions, thereby impairing fine-grained visual understanding. The authors formally identify and name this phenomenon, tracing its origin through embedding decomposition and attention map analysis to the information loss induced by spatial pooling mechanisms. To mitigate this issue without requiring model retraining, they propose a training-free strategy that integrates visual prompting with attention redistribution to effectively redirect the model’s focus toward non-central areas. Experimental results demonstrate that the proposed approach significantly enhances CLIP’s recognition performance on boundary-located objects while preserving its original architecture and parameters.
📝 Abstract
Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model's embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models' attention to off-center regions.