Self-Debias: Self-correcting for Debiasing Large Language Models

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the tendency of large language models to propagate biases during chain-of-thought reasoning and the limited adaptability of existing debiasing methods. The authors propose an endogenous self-correction mechanism that formulates debiasing as a dynamic constrained re-allocation of probabilistic resources. By leveraging trajectory-level preference optimization, the method corrects biased suffixes while preserving informative context prefixes. Integrated with consistency filtering and synthetically generated self-supervision signals, the framework enables online self-improvement without continuous external supervision. Remarkably, with only 20,000 annotated samples, the approach achieves substantial gains in debiasing efficacy while maintaining strong general reasoning capabilities.
📝 Abstract
Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chain-of-Thought (CoT) process, leading to continuous "Bias Propagation". Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model's output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints. This enables the model to selectively revise biased reasoning suffixes while preserving valid contextual prefixes. Furthermore, we integrate an online self-improvement mechanism utilizing consistency filtering to autonomously synthesize supervision signals. With merely 20k annotated samples, Self-Debias activates efficient self-correction, achieving superior debiasing performance while preserving general reasoning capabilities without continuous external oversight.
Problem

Research questions and friction points this paper is trying to address.

bias propagation
Large Language Models
Chain-of-Thought
debiasing
social biases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Debias
bias propagation
trajectory-level debiasing
resource redistribution
self-correction
🔎 Similar Papers
No similar papers found.