🤖 AI Summary
This work proposes a mechanistic fairness auditing framework to pinpoint the precise sources of demographic bias within vision models, addressing the limitation of existing methods that struggle to localize internal bias origins. Focusing on the CLIP vision encoder, the approach identifies gender and age biases at the level of individual attention heads through projected residual stream decomposition, zero-shot concept activation vectors (CAVs), and bias-amplifying textual prompt analysis. For the first time, this enables fine-grained attribution of population-level biases to specific attention mechanisms in a Vision Transformer. Experiments reveal four critical attention heads whose ablation significantly reduces gender bias—evidenced by a drop in Cramer’s V from 0.381 to 0.362—while slightly improving accuracy. In contrast, age bias is more diffusely encoded, yielding weaker ablation effects, thereby highlighting differences in how distinct demographic attributes are represented within the model.
📝 Abstract
Standard fairness audits of foundation models quantify that a model is biased, but not where inside the network the bias resides. We propose a mechanistic fairness audit that combines projected residual-stream decomposition, zero-shot Concept Activation Vectors, and bias-augmented TextSpan analysis to locate demographic bias at the level of individual attention heads in vision transformers. As a feasibility case study, we apply this pipeline to the CLIP ViT-L-14 encoder on 42 profession classes of the FACET benchmark, auditing both gender and age bias. For gender, the pipeline identifies four terminal-layer heads whose ablation reduces global bias (Cramer's V: 0.381 -> 0.362) while marginally improving accuracy (+0.42%); a layer-matched random control confirms that this effect is specific to the identified heads. A single head in the final layer contributes to the majority of the reduction in the most stereotyped classes, and class-level analysis shows that corrected predictions shift toward the correct occupation. For age, the same pipeline identifies candidate heads, but ablation produces weaker and less consistent effects, suggesting that age bias is encoded more diffusely than gender bias in this model. These results provide preliminary evidence that head-level bias localisation is feasible for discriminative vision encoders and that the degree of localisability may vary across protected attributes.
keywords: Bias . CLIP . Mechanistic Interpretability . Vision Transformer . Fairness