Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of conventional token-level policy gradient methods, which struggle to model semantic decision units spanning multiple tokens in complex reasoning tasks, leading to a misalignment between the optimization objective and the underlying reasoning structure. To overcome this, the authors propose the Multi-token Policy Optimization (MPO) framework, which extends the action space from individual tokens to contiguous sequences of K tokens—termed semantic blocks—and performs policy optimization at this block level. This approach better aligns with the high-level compositional structures inherent in tasks such as mathematical reasoning and code generation. Experimental results demonstrate that MPO significantly outperforms traditional token-level methods on established benchmarks for mathematical reasoning and code generation, thereby validating the efficacy and advantage of block-level policy optimization in complex reasoning scenarios.

Technology Category

Application Category

📝 Abstract
Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.
Problem

Research questions and friction points this paper is trying to address.

policy gradients
complex reasoning
token-level optimization
block-level structure
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-token Policy Gradient
Block-level Optimization
Complex Reasoning
Large Language Models
Semantic Actions
🔎 Similar Papers
No similar papers found.
M
Mufan Xu
School of Computer Science and Technology, Harbin Institute of Technology, China
Kehai Chen
Kehai Chen
Harbin Institute of Technolgy (Shenzhen)
LLMNatural Language ProcessingAgentMulti-model Generation
Xuefeng Bai
Xuefeng Bai
Harbin Institute of Technology (Shenzhen)
Natural language processingSemanticsDialogue
Z
Zhengyu Niu
Baidu Inc., Beijing, China
M
Muyun Yang
School of Computer Science and Technology, Harbin Institute of Technology, China
T
Tiejun Zhao
School of Computer Science and Technology, Harbin Institute of Technology, China
Min Zhang
Min Zhang
Professor of Computer Science, Soochow University
Statistical Machine TranslationNatural Language Processing and Computational LinguisticsIntelligent ComputingMachine Learning