🤖 AI Summary
In high-stakes decision-making (e.g., bail assignment, loan approval), large language models (LLMs) often amplify societal biases. To address this, we propose a generalizable Fairness Reward Model (FRM) that requires no model fine-tuning or strong supervision—only weakly supervised chain-of-thought (CoT) annotations. FRM identifies and suppresses bias-inducing nodes within reasoning paths, integrating CoT reasoning with reward-weighted path aggregation to enable fairness-driven, multi-path decision optimization. It transfers seamlessly across tasks, domains, and LLM families. Evaluated on real-world applications—including recidivism prediction and social media content moderation—FRM improves group fairness significantly (e.g., reducing equal opportunity difference by 37%) while maintaining or exceeding baseline accuracy. Our core contribution is the first lightweight, fine-tuning-free, and highly generalizable framework for post-hoc fairness correction in LLM-based reasoning.
📝 Abstract
Large language models are increasingly used to support high-stakes decisions, potentially influencing who is granted bail or receives a loan. Naive chain-of-thought sampling can improve average decision accuracy, but has also been shown to amplify unfair bias. To address this challenge and enable the trustworthy use of reasoning models in high-stakes decision-making, we propose a framework for training a generalizable Fairness Reward Model (FRM). Our model assigns a fairness score to LLM reasoning, enabling the system to down-weight biased trajectories and favor equitable ones when aggregating decisions across reasoning chains. We show that a single Fairness Reward Model, trained on weakly supervised, LLM-annotated examples of biased versus unbiased reasoning, transfers across tasks, domains, and model families without additional fine-tuning. Applied to real-world decision-making tasks including recidivism prediction and social media moderation, we show that our approach consistently improves fairness while matching, or even surpassing, baseline accuracy.