🤖 AI Summary
Large reasoning models under unsupervised settings suffer from reliance on spurious majority voting and struggle to achieve sustained self-improvement. Method: This paper proposes RESTRAIN—a self-driven reinforcement learning framework that requires no human annotations. Its core innovation is the first use of confidence and consistency signals derived from the model’s own chain-of-thought (CoT) distribution to construct a self-penalization mechanism. This mechanism converts erroneous consensus in unlabeled data into learning signals, dynamically suppressing overconfident outputs and low-consistency samples while preserving potentially valid reasoning paths. RESTRAIN integrates distribution-aware CoT selection and calibration using policy optimization algorithms such as GRPO. Results: Experiments show substantial improvements in Pass@1: +140.7% on AIME25, +36.2% on MMLU-STEM, and +19.6% on GPQA-Diamond—approaching fully supervised baselines and markedly advancing beyond traditional RL paradigms dependent on gold-standard labels.
📝 Abstract
Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.