🤖 AI Summary
This paper addresses the challenge of detecting and mitigating gender stereotypes and biases in large language models (LLMs), where such biases are deeply embedded and difficult to identify or intervene upon. Methodologically, it introduces an unsupervised, internal-representation-based framework: first, it extracts interpretable and reusable gender-concept representation vectors from unlabeled data via a novel probabilistic weighting mechanism; second, it designs a projection-based latent-space steering technique for precise, targeted control over the generation process. Contributions include: (1) the first annotation-free paradigm for gender representation extraction; (2) a representation-engineering framework that jointly ensures interpretability and operational controllability; and (3) significant reductions in gender bias across multiple fairness benchmarks—e.g., occupational-gender association bias—while preserving core language modeling performance.
📝 Abstract
Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate potential harms that may result from these biases, but most work studies biases in LLMs as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of"gender"is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model's representation. We also present a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs.