Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

In verifiable reward reinforcement learning (RLVR), excessive pursuit of accuracy often induces entropy collapse, degrading exploration and reasoning performance. To address this, we propose Semantic-Entropy-guided RL (SE-RL): (1) it replaces token-level entropy with semantic entropy—a more principled measure of reasoning uncertainty—to guide curriculum learning and progressively increase reasoning difficulty; (2) it introduces non-uniform token-wise KL regularization, imposing stronger constraints on low-entropy (highly confident) tokens while enhancing stability in high-variance regions. Evaluated across six reasoning benchmarks using three base model scales, SE-RL significantly mitigates entropy collapse, improves policy exploration, and achieves superior final reasoning performance compared to existing entropy-regulation methods.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has demonstrated superior performance in enhancing the reasoning capability of large language models (LLMs). However, this accuracy-oriented learning paradigm often suffers from entropy collapse, which reduces policy exploration and limits reasoning capabilities. To address this challenge, we propose an efficient reinforcement learning framework that leverages entropy signals at both the semantic and token levels to improve reasoning. From the data perspective, we introduce semantic entropy-guided curriculum learning, organizing training data from low to high semantic entropy to guide progressive optimization from easier to more challenging tasks. For the algorithmic design, we adopt non-uniform token treatment by imposing KL regularization on low-entropy tokens that critically impact policy exploration and applying stronger constraints on high-covariance portions within these tokens. By jointly optimizing data organization and algorithmic design, our method effectively mitigates entropy collapse and enhances LLM reasoning. Experimental results across 6 benchmarks with 3 different parameter-scale base models demonstrate that our method outperforms other entropy-based approaches in improving reasoning.

Problem

Research questions and friction points this paper is trying to address.

Mitigates entropy collapse in RLVR for LLMs

Enhances reasoning via semantic and token entropy

Improves policy exploration with curriculum learning and KL regularization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic entropy-guided curriculum learning for progressive optimization

Non-uniform token treatment with KL regularization on low-entropy tokens

Joint optimization of data organization and algorithmic design

🔎 Similar Papers

Self-playing Adversarial Language Game Enhances LLM Reasoning