Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) like CLIP suffer from background-based spurious correlations in zero-shot recognition, yet existing evaluation protocols cannot disentangle performance degradation caused by background bias from confounding factors such as viewpoint/scale variation or fine-grained confusion. Method: We propose Cluster-based Concept Importance (CCI), an interpretable attribution method that leverages CLIP’s patch embeddings for semantic clustering, integrates masked-response analysis, and incorporates GroundedSAM-based foreground-background segmentation to quantify concept importance. Contribution/Results: Building upon CCI, we introduce COVAR—the first controlled-variable benchmark that systematically isolates multiple interference sources. Experiments show CCI improves deletion-AUC on MS COCO by over 2×, setting a new state-of-the-art in attribution faithfulness. COVAR’s evaluation of 18 CLIP variants exposes critical limitations of prevailing debiasing benchmarks, providing both diagnostic tools and rigorous evaluation standards for developing robust VLMs.

Technology Category

Application Category

📝 Abstract
Contrastive vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP's own patch embeddings to group spatial patches into semantically coherent clusters, mask them, and evaluate relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI, when combined with GroundedSAM, automatically categorizes predictions as foreground- or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we present a comprehensive evaluation of eighteen CLIP variants, offering methodological advances and empirical evidence that chart a path toward more robust VLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating CLIP's vulnerability to spurious background correlations
Proposing Cluster-based Concept Importance for interpretable spatial analysis
Introducing COVAR benchmark to disentangle foreground and background effects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cluster-based Concept Importance method for interpretability
Combining CCI with GroundedSAM for diagnostic categorization
Introducing COVAR benchmark for systematic foreground-background variation
🔎 Similar Papers
No similar papers found.