🤖 AI Summary
Large language models (LLMs) often produce superficially plausible yet poorly calibrated responses, undermining their reliability in complex reasoning tasks. To address this, we propose a self-supervised reinforcement learning post-training framework that requires no external annotations: a frozen LLM generates multiple chain-of-thought (CoT) trajectories; internal confidence scores—derived from token-level log-probabilities—are used to rank and score answer segments, thereby constructing synthetic preference data; this data is then leveraged for policy refinement via preference optimization algorithms (e.g., DPO). Our approach is the first to directly model the model’s intrinsic confidence as an optimizable, self-generated reward signal, eliminating dependence on human annotations, ground-truth answers, or external reward models. Experiments demonstrate substantial improvements in probabilistic calibration and consistent performance gains across arithmetic reasoning and multiple-choice benchmarks, validating the effectiveness and high data efficiency of intrinsic-reward-driven post-training.
📝 Abstract
Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards.
RLSF simultaneously (i) refines the model's probability estimates -- restoring well-behaved calibration -- and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering.
By turning a model's own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrents further research in intrinsic rewards for LLM post-training.