Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In verifiable reward reinforcement learning (RLVR), excessive pursuit of accuracy often induces entropy collapse, degrading exploration and reasoning performance. To address this, we propose Semantic-Entropy-guided RL (SE-RL): (1) it replaces token-level entropy with semantic entropy—a more principled measure of reasoning uncertainty—to guide curriculum learning and progressively increase reasoning difficulty; (2) it introduces non-uniform token-wise KL regularization, imposing stronger constraints on low-entropy (highly confident) tokens while enhancing stability in high-variance regions. Evaluated across six reasoning benchmarks using three base model scales, SE-RL significantly mitigates entropy collapse, improves policy exploration, and achieves superior final reasoning performance compared to existing entropy-regulation methods.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has demonstrated superior performance in enhancing the reasoning capability of large language models (LLMs). However, this accuracy-oriented learning paradigm often suffers from entropy collapse, which reduces policy exploration and limits reasoning capabilities. To address this challenge, we propose an efficient reinforcement learning framework that leverages entropy signals at both the semantic and token levels to improve reasoning. From the data perspective, we introduce semantic entropy-guided curriculum learning, organizing training data from low to high semantic entropy to guide progressive optimization from easier to more challenging tasks. For the algorithmic design, we adopt non-uniform token treatment by imposing KL regularization on low-entropy tokens that critically impact policy exploration and applying stronger constraints on high-covariance portions within these tokens. By jointly optimizing data organization and algorithmic design, our method effectively mitigates entropy collapse and enhances LLM reasoning. Experimental results across 6 benchmarks with 3 different parameter-scale base models demonstrate that our method outperforms other entropy-based approaches in improving reasoning.
Problem

Research questions and friction points this paper is trying to address.

Mitigates entropy collapse in RLVR for LLMs
Enhances reasoning via semantic and token entropy
Improves policy exploration with curriculum learning and KL regularization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic entropy-guided curriculum learning for progressive optimization
Non-uniform token treatment with KL regularization on low-entropy tokens
Joint optimization of data organization and algorithmic design
🔎 Similar Papers
No similar papers found.
Hongye Cao
Hongye Cao
Chang'an University
Remote sensing
Zhixin Bai
Zhixin Bai
Harbin Institute of Technology
natural language processing
Z
Ziyue Peng
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China
B
Boyan Wang
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China
T
Tianpei Yang
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China
Jing Huo
Jing Huo
Nanjing University
Machine LearningComputer Vision
Yuyao Zhang
Yuyao Zhang
Renmin University of China
Artificial Intelligence
Y
Yang Gao
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China