Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing reinforcement learning methods struggle to explicitly control the exploration–exploitation trade-off during large language model (LLM) inference, particularly in settings with verifiable rewards. To address this, we propose Token Hidden Reward (THR), a fine-grained, token-level influence metric that quantifies each token’s contribution toward the final correct answer. Leveraging THR, we design a dynamic reweighting mechanism within the Group Relative Policy Optimization (GRPO) framework, enabling explicit and tunable control: suppressing tokens with negative THR (promoting exploration) and amplifying those with positive THR (enhancing exploitation). Our approach is compatible with mainstream RL objectives (e.g., GSPO) and supports diverse LLM architectures, including Llama. Experiments on mathematical reasoning benchmarks demonstrate that THR-guided optimization significantly improves both greedy-decoding accuracy (+exploitation) and Pass@K performance (+exploration), while exhibiting strong generalization across tasks and models.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.

Problem

Research questions and friction points this paper is trying to address.

Quantifying token influence on correct responses in reinforcement learning

Balancing exploration and exploitation through token-level reward modulation

Improving reasoning accuracy via controlled training dynamics in language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Hidden Reward quantifies token influence on responses

THR-guided reweighting algorithm modulates learning signals

Algorithm dynamically controls exploration and exploitation in RL

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL