Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Vision-language models like CLIP excel at global semantic alignment but suffer from an inherent global bias, exhibiting limited capacity for fine-grained local feature modeling. To address this, we propose Multi-Crop Enhancement (MCE), a plug-and-play inference-time augmentation method that explicitly activates CLIP’s perception of and cross-modal alignment with local visual regions. MCE operates by randomly cropping input images to restrict the receptive field and pairing each crop with corresponding fine-grained textual descriptions—requiring no architectural modification or model retraining. Extensive experiments demonstrate consistent and significant improvements across three key downstream tasks: zero-shot classification, few-shot transfer learning, and test-time adaptation. These results validate MCE’s effectiveness, generalizability, and deployment efficiency, offering a novel and practical approach to mitigating the local semantic blind spot inherent in current vision-language models.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) like CLIP achieve cross-modal semantic alignment through contrastive learning, exhibiting robust zero-shot generalization. Traditional prompt engineering, however, predominantly relies on coarse-grained category labels, neglecting fine-grained local semantics. Existing approaches assume that VLMs inherently recognize localized visual details and attempt to enhance classification by augmenting text prompts with attribute descriptors generated by large language models. However, our systematic experiments reveal critical limitations: CLIP's strong bias toward global image patterns hinders its ability to process localized visual descriptors. To address this fundamental constraint, we propose a simple, effective, and plug-and-play solution that enables CLIP to ``See Both the Forest and the Trees." Specifically, we employ stochastic multi-crop augmentation to activate CLIP's latent capacity for localized feature analysis. By cropping only partial regions, the approach effectively constrains the model's receptive field and recalibrates its attention mechanism, thereby mitigating its inherent bias. We evaluate the proposed method under zero-shot, few-shot, and test-time adaptation settings, and extensive experiments demonstrate that D&D achieves promising performance.

Problem

Research questions and friction points this paper is trying to address.

CLIP's bias toward global patterns limits local detail recognition

Traditional prompts neglect fine-grained visual semantics in VLMs

Enhancing CLIP's ability to process localized visual descriptors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic multi-crop augmentation for localized features

Constrains receptive field to recalibrate attention

Plug-and-play solution for CLIP bias mitigation

🔎 Similar Papers

Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos