Maximizing Confidence Alone Improves Reasoning

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) struggle to improve complex reasoning capabilities without external reward signals or ground-truth supervision. Method: We propose RENT, a reinforcement learning framework that performs end-to-end entropy minimization using only the entropy of the model’s own output distribution as an intrinsic reward signal. RENT fully replaces handcrafted reward functions or human-annotated answers with self-assessed confidence (i.e., output entropy), and integrates chain-of-thought distillation with unsupervised policy updates to achieve purely self-supervised reasoning enhancement. Contribution/Results: Evaluated across diverse benchmarks—including GSM8K, MATH500, AMC, AIME, and GPQA—RENT consistently improves mathematical and scientific reasoning performance for Qwen and Mistral series models. It demonstrates, for the first time, that complex reasoning can be effectively enhanced without any external supervision, establishing both feasibility and cross-domain generalizability of pure self-supervision in reasoning optimization.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier language models to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization -- a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model's entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reasoning benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen and Mistral families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is unavailable.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised RL method for reasoning without external rewards
Improves reasoning by maximizing model confidence via entropy minimization
Applicable in domains with limited or no external supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised RL via entropy minimization
Intrinsic reward from model confidence
Improves reasoning without external rewards
🔎 Similar Papers
No similar papers found.
Mihir Prabhudesai
Mihir Prabhudesai
PhD Student at CMU Robotics
L
Lili Chen
Carnegie Mellon University
A
Alex Ippoliti
Carnegie Mellon University
Katerina Fragkiadaki
Katerina Fragkiadaki
Associate Professor, Carnegie Mellon University
Computer VisionMachine LearningLanguage GroundingRobotics
H
Hao Liu
Carnegie Mellon University
D
Deepak Pathak
Carnegie Mellon University