🤖 AI Summary
This work investigates how evaluator rationality—measured via cognitive psychology scales and controlled behavioral experiments—affects reward signal stability in Reinforcement Learning from Human Feedback (RLHF), identifying cognitive capacity disparities as a primary source of inconsistent, biased, and unreliable human feedback. We empirically establish a strong correlation between evaluator rationality and feedback quality (p < 0.01), the first such validation in RLHF literature. Building on this finding, we propose a tripartite governance framework: pre-screening of evaluators, consistency auditing, and reliability-weighted aggregation. Experiments demonstrate that feedback from high-rationality evaluators achieves 42% higher consistency and 35% greater alignment with expert judgments. The framework significantly improves RLHF training stability and enhances model robustness—particularly against distributional shifts—and fairness. Collectively, it provides a scalable, empirically grounded methodology for trustworthy AI alignment.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) is central in aligning large language models (LLMs) with human values and expectations. However, the process remains susceptible to governance challenges, including evaluator bias, inconsistency, and the unreliability of feedback. This study examines how the cognitive capacity of evaluators, specifically their level of rationality, affects the stability of reinforcement signals. A controlled experiment comparing high-rationality and low-rationality participants reveals that evaluators with higher rationality scores produce significantly more consistent and expert-aligned feedback. In contrast, lower-rationality participants demonstrate considerable variability in their reinforcement decisions ($p<0.01$). To address these challenges and improve RLHF governance, we recommend implementing evaluator pre-screening, systematic auditing of feedback consistency, and reliability-weighted reinforcement aggregation. These measures enhance the fairness, transparency, and robustness of AI alignment pipelines.