Reinforcement Learning with Promising Tokens for Large Language Models

๐Ÿ“… 2026-02-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of excessively large action spaces in large language models (LLMs) for reinforcement learning, where a vast number of context-irrelevant tokens hinder policy optimization. To mitigate this, the authors propose the RLPT framework, which leverages semantic priors to dynamically identify โ€œpromising tokensโ€ and restricts the policy to low-rank subspaces corresponding to effective reasoning paths. A dynamic masking mechanism further decouples the decision-making process from token generation. The approach is compatible with existing algorithms such as GRPO and DAPO, and demonstrates significant improvements in training stability and sample efficiency across mathematical reasoning, programming, and communication tasks. Moreover, RLPT scales effectively across model sizes, including 4B and 8B parameter variants, without compromising performance.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement learning (RL) has emerged as a key paradigm for aligning and optimizing large language models (LLMs). Standard approaches treat the LLM as the policy and apply RL directly over the full vocabulary space. However, this formulation includes the massive tail of contextually irrelevant tokens in the action space, which could distract the policy from focusing on decision-making among the truly reasonable tokens. In this work, we verify that valid reasoning paths could inherently concentrate within a low-rank subspace. Based on this insight, we introduce Reinforcement Learning with Promising Tokens (RLPT), a framework that mitigates the action space issue by decoupling strategic decision-making from token generation. Specifically, RLPT leverages the semantic priors of the base model to identify a dynamic set of \emph{promising tokens} and constrains policy optimization exclusively to this refined subset via masking. Theoretical analysis and empirical results demonstrate that RLPT effectively reduces gradient variance, stabilizes the training process, and improves sample efficiency. Experiment results on math, coding, and telecom reasoning show that RLPT outperforms standard RL baselines and integrates effectively across various model sizes (4B and 8B) and RL algorithms (GRPO and DAPO).
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Large Language Models
Action Space
Token Generation
Policy Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning
Large Language Models
Promising Tokens
Action Space Reduction
Sample Efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.
Jing-Cheng Pang
Jing-Cheng Pang
Researcher, Huawei; Nanjing University
reinforcement learninglanguage-conditioned RLlarge language model
L
Liang Lu
Huawei Technologies Co., Ltd.
X
Xian Tang
Huawei Technologies Co., Ltd.
Kun Jiang
Kun Jiang
Tsinghua University
autonomous driving
S
Sijie Wu
Huawei Technologies Co., Ltd.
K
Kai Zhang
Huawei Technologies Co., Ltd.
X
Xubin Li
Huawei Technologies Co., Ltd.