Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Multimodal models like CLIP exhibit strong zero-shot capabilities but suffer from inherent biases due to spurious correlations—e.g., between background cues or gender attributes and target classes—arising from attention mechanisms. To address this, we propose Locate-Then-Correct (LTC), a contrastive debiasing framework. First, leveraging mechanism-driven interpretability analysis of attention heads, LTC precisely identifies bias-correlated heads. Then, it applies zero-shot, fine-tuning-free interventions: targeted head pruning and task-relevant head amplification, coupled with orthogonal projection for feature decoupling and fusion. To our knowledge, LTC is the first post-hoc, training-free zero-shot debiasing method. On multiple intrinsic bias benchmarks, it improves worst-group accuracy by over 50%, substantially outperforming existing non-training-based approaches. Comprehensive visualization and ablation studies validate both the accuracy of bias-head identification and the efficacy of the correction mechanism.

Technology Category

Application Category

📝 Abstract

Multimodal models like CLIP have gained significant attention due to their remarkable zero-shot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce extsc{Locate-Then-Correct} (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a $>50%$ gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation of selected heads and find that the presented interpretation corroborates our contrastive mechanism for identifying both spurious and salient attention heads. Code available at https://github.com/wj210/CLIP_LTC.

Problem

Research questions and friction points this paper is trying to address.

Identifies spurious attention heads in CLIP models

Mitigates biases via targeted ablation and orthogonal projection

Improves worst-group accuracy in biased benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies spurious attention heads via mechanistic insights

Mitigates bias through targeted ablation technique

Improves classification via orthogonal projection integration

🔎 Similar Papers

From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings