Refining Skewed Perceptions in Vision-Language Models through Visual Representations

📅 2024-05-22
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models like CLIP suffer from spurious correlations induced by biases in training data, undermining downstream task robustness. Method: This work proposes a novel downstream adaptation paradigm grounded in visual representations, revealing—through systematic multimodal embedding analysis—that CLIP’s text embeddings are highly susceptible to bias contamination, whereas its visual embeddings exhibit superior robustness. Leveraging this insight, we design a lightweight linear probe that extracts task-critical features without fine-tuning. Contribution/Results: We introduce a quantitative framework for bias diagnosis and spurious correlation measurement. Evaluated on multiple bias-sensitive benchmarks, our visual-representation-based approach achieves an average accuracy gain of 12.3% over text-based counterparts, significantly mitigating perceptual skew and enhancing generalization across diverse downstream tasks.

Technology Category

Application Category

📝 Abstract
Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our codes will be available here.
Problem

Research questions and friction points this paper is trying to address.

Visual Information
Bias Mitigation
Model Performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bias Correction
Visual Language Models
Image Processing