RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models under unsupervised settings suffer from reliance on spurious majority voting and struggle to achieve sustained self-improvement. Method: This paper proposes RESTRAIN—a self-driven reinforcement learning framework that requires no human annotations. Its core innovation is the first use of confidence and consistency signals derived from the model’s own chain-of-thought (CoT) distribution to construct a self-penalization mechanism. This mechanism converts erroneous consensus in unlabeled data into learning signals, dynamically suppressing overconfident outputs and low-consistency samples while preserving potentially valid reasoning paths. RESTRAIN integrates distribution-aware CoT selection and calibration using policy optimization algorithms such as GRPO. Results: Experiments show substantial improvements in Pass@1: +140.7% on AIME25, +36.2% on MMLU-STEM, and +19.6% on GPQA-Diamond—approaching fully supervised baselines and markedly advancing beyond traditional RL paradigms dependent on gold-standard labels.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.
Problem

Research questions and friction points this paper is trying to address.

Developing self-penalizing RL for unsupervised reasoning model improvement
Converting absence of gold labels into useful learning signals
Reducing reliance on costly human annotations in reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-penalizing RL framework using unlabeled data
Penalizes overconfident rollouts and low-consistency examples
Integrates self-penalization into policy optimization methods
🔎 Similar Papers
No similar papers found.