Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the decoupling of demographic bias mechanisms from foundational identification capabilities in language models: specifically, whether racial/gender stereotypes can be mitigated without degrading accuracy in recognizing names, occupations, or education levels. We propose a multi-task evaluation framework and sparse autoencoder-based interpretability analysis, revealing—for the first time—that bias stems from task-specific internal mechanisms rather than global demographic token representations. Building on this insight, we design dual-path intervention strategies: attribution-driven and correlation-driven feature localization and editing. Applying surgical inference-time interventions on Gemma-2-9B yields three key results: (1) occupational stereotyping is significantly reduced while name recognition accuracy remains intact; (2) educational bias mitigation is more effectively achieved via correlation-driven methods; and (3) attribution-based ablation induces prior collapse and paradoxical bias amplification, empirically validating both the necessity and efficacy of mechanistic decoupling.

Technology Category

Application Category

📝 Abstract
We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.
Problem

Research questions and friction points this paper is trying to address.

Investigates whether language models can remove demographic bias while preserving demographic recognition capabilities
Compares attribution-based and correlation-based methods for locating bias features in models
Examines if bias arises from task-specific mechanisms rather than absolute demographic markers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Targeted sparse autoencoder feature ablations reduce bias
Attribution-based ablations mitigate stereotypes while preserving accuracy
Correlation-based ablations effectively address education bias
🔎 Similar Papers
No similar papers found.