Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Large language models (LLMs) often produce superficially plausible yet poorly calibrated responses, undermining their reliability in complex reasoning tasks. To address this, we propose a self-supervised reinforcement learning post-training framework that requires no external annotations: a frozen LLM generates multiple chain-of-thought (CoT) trajectories; internal confidence scores—derived from token-level log-probabilities—are used to rank and score answer segments, thereby constructing synthetic preference data; this data is then leveraged for policy refinement via preference optimization algorithms (e.g., DPO). Our approach is the first to directly model the model’s intrinsic confidence as an optimizable, self-generated reward signal, eliminating dependence on human annotations, ground-truth answers, or external reward models. Experiments demonstrate substantial improvements in probabilistic calibration and consistent performance gains across arithmetic reasoning and multiple-choice benchmarks, validating the effectiveness and high data efficiency of intrinsic-reward-driven post-training.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards. RLSF simultaneously (i) refines the model's probability estimates -- restoring well-behaved calibration -- and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering. By turning a model's own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrents further research in intrinsic rewards for LLM post-training.

Problem

Research questions and friction points this paper is trying to address.

Improves LLM answer calibration via self-feedback

Enhances reasoning without external human labels

Uses model confidence as intrinsic reward signal

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses model's confidence as intrinsic reward

Fine-tunes policy with synthetic preferences

Improves calibration and reasoning performance

🔎 Similar Papers

No similar papers found.