Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

To address the core challenges of insufficient exploration and the difficulty in balancing reasoning diversity with factual accuracy in large language models (LLMs) under reinforcement learning–based inference, this paper proposes a token-level multi-temperature sampling mechanism. It dynamically classifies tokens into high-entropy (reasoning-oriented) and low-entropy (knowledge-oriented) categories and applies correspondingly higher or lower temperature values. Crucially, temperature scheduling is performed at the rollout level, enabling fine-grained, adaptive generation control. This is the first approach to decouple temperature parameters along the token-type dimension—requiring no additional training or architectural modifications. Evaluated across multiple reasoning benchmarks—including GSM8K, MMLU, and HotpotQA—the method achieves significant improvements in both accuracy and inference stability. Empirical results demonstrate that differentiated temperature strategies effectively enhance exploratory capability while preserving factual consistency.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning has demonstrated substantial improvements in the reasoning abilities of Large Language Models (LLMs), exhibiting significant applicability across various domains. Recent research has identified that tokens within LLMs play distinct roles during reasoning tasks, categorizing them into high-entropy reasoning tokens and low-entropy knowledge tokens. Prior approaches have typically focused on restricting updates to indirectly encourage exploration, yet they do not explicitly facilitate exploratory behavior during the token generation stage itself. In this work, we introduce a complementary approach that explicitly promotes exploration during sampling by applying distinct temperature settings for different token types. Specifically, our method employs higher temperatures for reasoning tokens to actively encourage exploration, while retaining lower temperatures for knowledge tokens to maintain factual correctness. Furthermore, we systematically investigate various multi-temperature scheduling strategies and their impacts within reinforcement learning contexts. Empirical evaluations on several reasoning benchmarks demonstrate that our approach significantly enhances the reasoning performance of LLMs. The code is available at https://github.com/zhmzm/Multi_Temperature_Verl.git.

Problem

Research questions and friction points this paper is trying to address.

Explores multi-temperature strategies for token-level control in RL

Enhances reasoning tokens exploration while preserving knowledge tokens accuracy

Investigates temperature scheduling impacts on LLM reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-temperature sampling for distinct token types

Higher temperature for reasoning tokens promotes exploration

Lower temperature for knowledge tokens maintains accuracy

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL