Sensing and Steering Stereotypes: Extracting and Applying Gender Representation Vectors in LLMs

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This paper addresses the challenge of detecting and mitigating gender stereotypes and biases in large language models (LLMs), where such biases are deeply embedded and difficult to identify or intervene upon. Methodologically, it introduces an unsupervised, internal-representation-based framework: first, it extracts interpretable and reusable gender-concept representation vectors from unlabeled data via a novel probabilistic weighting mechanism; second, it designs a projection-based latent-space steering technique for precise, targeted control over the generation process. Contributions include: (1) the first annotation-free paradigm for gender representation extraction; (2) a representation-engineering framework that jointly ensures interpretability and operational controllability; and (3) significant reductions in gender bias across multiple fairness benchmarks—e.g., occupational-gender association bias—while preserving core language modeling performance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate potential harms that may result from these biases, but most work studies biases in LLMs as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of"gender"is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model's representation. We also present a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Extracting gender representation vectors in LLMs

Measuring and manipulating model's gender biases

Mitigating gender stereotypes in language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts concept representations via probability weighting

Introduces projection-based method for steering

Mitigates gender bias in language models

🔎 Similar Papers

No similar papers found.