Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.
Problem

Research questions and friction points this paper is trying to address.

sampling temperature
exploration-exploitation trade-off
large language models
reinforcement learning
decoding strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Reinforcement Learning
Sampling Temperature
LLM Internal States
Reinforcement Learning from Verifiable Rewards
Adaptive Exploration
🔎 Similar Papers
No similar papers found.