RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Large reasoning models under unsupervised settings suffer from reliance on spurious majority voting and struggle to achieve sustained self-improvement. Method: This paper proposes RESTRAIN—a self-driven reinforcement learning framework that requires no human annotations. Its core innovation is the first use of confidence and consistency signals derived from the model’s own chain-of-thought (CoT) distribution to construct a self-penalization mechanism. This mechanism converts erroneous consensus in unlabeled data into learning signals, dynamically suppressing overconfident outputs and low-consistency samples while preserving potentially valid reasoning paths. RESTRAIN integrates distribution-aware CoT selection and calibration using policy optimization algorithms such as GRPO. Results: Experiments show substantial improvements in Pass@1: +140.7% on AIME25, +36.2% on MMLU-STEM, and +19.6% on GPQA-Diamond—approaching fully supervised baselines and markedly advancing beyond traditional RL paradigms dependent on gold-standard labels.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.

Problem

Research questions and friction points this paper is trying to address.

Developing self-penalizing RL for unsupervised reasoning model improvement

Converting absence of gold labels into useful learning signals

Reducing reliance on costly human annotations in reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-penalizing RL framework using unlabeled data

Penalizes overconfident rollouts and low-consistency examples

Integrates self-penalization into policy optimization methods

🔎 Similar Papers

No similar papers found.