🤖 AI Summary
This work investigates the decoupling of demographic bias mechanisms from foundational identification capabilities in language models: specifically, whether racial/gender stereotypes can be mitigated without degrading accuracy in recognizing names, occupations, or education levels. We propose a multi-task evaluation framework and sparse autoencoder-based interpretability analysis, revealing—for the first time—that bias stems from task-specific internal mechanisms rather than global demographic token representations. Building on this insight, we design dual-path intervention strategies: attribution-driven and correlation-driven feature localization and editing. Applying surgical inference-time interventions on Gemma-2-9B yields three key results: (1) occupational stereotyping is significantly reduced while name recognition accuracy remains intact; (2) educational bias mitigation is more effectively achieved via correlation-driven methods; and (3) attribution-based ablation induces prior collapse and paradoxical bias amplification, empirically validating both the necessity and efficacy of mechanistic decoupling.
📝 Abstract
We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.