Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitation of conventional token-level policy gradient methods, which struggle to model semantic decision units spanning multiple tokens in complex reasoning tasks, leading to a misalignment between the optimization objective and the underlying reasoning structure. To overcome this, the authors propose the Multi-token Policy Optimization (MPO) framework, which extends the action space from individual tokens to contiguous sequences of K tokens—termed semantic blocks—and performs policy optimization at this block level. This approach better aligns with the high-level compositional structures inherent in tasks such as mathematical reasoning and code generation. Experimental results demonstrate that MPO significantly outperforms traditional token-level methods on established benchmarks for mathematical reasoning and code generation, thereby validating the efficacy and advantage of block-level policy optimization in complex reasoning scenarios.

Technology Category

Application Category

📝 Abstract

Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.

Problem

Research questions and friction points this paper is trying to address.

policy gradients

complex reasoning

token-level optimization

block-level structure

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-token Policy Gradient

Block-level Optimization

Complex Reasoning