Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models are prone to unfair judgments due to irrelevant biases—such as those based on gender or race—and a tendency toward flattery. This work proposes a self-blinding mechanism that, for the first time, enables a model to simulate its own counterfactual “blinded” counterpart via API calls, thereby distinguishing between implicit and intentional biases. By integrating counterfactual prompting with a bias detection framework, the approach significantly mitigates gender and racial biases as well as flattery behaviors without relying on external intervention. The method enhances both fairness and transparency in model-based decision-making and demonstrates greater stability and interpretability compared to conventional prompting strategies.

Technology Category

Application Category

📝 Abstract
Fair decisions require ignoring irrelevant, potentially biasing, information. To achieve this, decision-makers need to approximate what decision they would have made had they not known certain facts, such as the gender or race of a job candidate. This counterfactual self-simulation is notoriously hard for humans, leading to biased judgments even by well-meaning actors. Here we show that large language models (LLMs) suffer from similar limitations in their ability to approximate what decisions they would make under counterfactual knowledge in offsetting gender and race biases and overcoming sycophancy. We show that prompting models to ignore or pretend not to know biasing information fails to offset these biases and occasionally backfires. However, unlike humans, LLMs can be given access to a ground-truth model of their own counterfactual cognition -- their own API. We show that this access to the responses of a blinded replica enables fairer decisions, while providing greater transparency to distinguish implicit from intentionally biased behavior.
Problem

Research questions and friction points this paper is trying to address.

bias
sycophancy
counterfactual reasoning
fairness
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-blinding
counterfactual self-simulation
bias mitigation
large language models
sycophancy
🔎 Similar Papers
No similar papers found.