When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Unsupervised reinforcement learning (RL) often fails in mathematical reasoning due to opaque training dynamics, policy collapse, and reward hacking, with little systematic understanding of the conditions under which it succeeds. This work proposes a concise, deterministic intrinsic reward mechanism integrated with large language models for unsupervised RL and introduces a geometric perspective based on manifold envelopes to analyze training dynamics. We reveal, for the first time, that a model’s prior logical capabilities are the critical determinant of success or failure in unsupervised RL. Furthermore, we establish low-dimensional manifold envelopes as an interpretable geometric criterion for training stability, thereby delineating the effective boundary within which this approach can reliably enhance mathematical reasoning.

Technology Category

Application Category

📝 Abstract

Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite of intrinsic rewards that explicitly enforce concise and certain generation. Second, to discover the boundaries of this approach, we test base models across a spectrum of intrinsic reasoning capabilities, revealing how a model's foundational logical prior dictates its success or failure. Finally, to demystify why certain configurations stabilize while others collapse, we introduce a novel geometric diagnostic lens, showing that successful cases are enveloped by manifolds. Ultimately, our work goes beyond merely demonstrating that enforcing concise and certain responses successfully boosts mathematical reasoning; we reveal when this unsupervised approach breaks down and geometrically diagnose why.

Problem

Research questions and friction points this paper is trying to address.

Unsupervised RL

Mathematical Reasoning

Intrinsic Rewards

Policy Collapse

Reward Hacking

Innovation

Methods, ideas, or system contributions that make the work stand out.

unsupervised reinforcement learning

intrinsic reward

mathematical reasoning