Rethinking Entropy Regularization in Large Reasoning Models

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address entropy collapse and premature convergence in large reasoning models (LRMs) during reinforcement learning with verifiable rewards (RLVR)—caused by large action spaces and long trajectories—this paper proposes SIREN, a selective entropy regularization method. Its core innovation is a two-stage entropy masking mechanism: (i) dynamic top-p truncation combined with local peak-entropy masking to adaptively constrain the exploration space, and (ii) self-anchored entropy regularization to stabilize training over long sequences. Unlike global entropy penalties, SIREN prevents entropy explosion while preserving both exploration diversity and policy convergence. Evaluated on five mathematical reasoning benchmarks, SIREN consistently outperforms prior methods: Qwen2.5-Math-7B achieves a +6.6 absolute gain in maj@k on AIME24/25, without sacrificing pass@k performance—demonstrating substantial improvements in reasoning diversity and robustness.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has shown great promise in enhancing the reasoning abilities of large reasoning models (LRMs). However, it suffers from a critical issue: entropy collapse and premature convergence. Naive entropy regularization, a common approach for encouraging exploration in the traditional RL literature, fails to address this problem in the context of LRM. Our analysis reveals that this failure stems from the vast action space and long trajectories in LRMs, which easily trigger a global entropy explosion as the model indiscriminately explores all possible actions and states. To address this, we propose SIREN (SelectIve entRopy rEgularizatioN), a method that confines exploration to a meaningful subset of actions and states. SIREN achieves this through a two-step entropy masking mechanism, consisting of a top-p mask and a peak-entropy mask. In addition, regularization is transformed into a self-anchored form to stabilize training. Across five mathematical benchmarks, SIREN attains superior average performance over previous entropy-related RLVR approaches, exemplified by a +6.6 maj@k improvement on AIME24/25 with Qwen2.5-Math-7B. Further analysis confirms that SIREN promotes greater response diversity and maintains entropy at an appropriate level, which helps to preserve the validation pass@k throughout training. This effectively mitigates the premature convergence problem common in RLVR for LRM.
Problem

Research questions and friction points this paper is trying to address.

Addressing entropy collapse in large reasoning models
Preventing premature convergence during reinforcement learning
Controlling global entropy explosion in vast action spaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective entropy regularization for large reasoning models
Two-step entropy masking mechanism for exploration
Self-anchored regularization to stabilize training