StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
Existing token-based reinforcement learning approaches struggle to effectively model the decision-making process of large language model (LLM) agents in multi-turn interactions, particularly under conditions of delayed sparse rewards and highly variable context lengths. This work proposes StepPO, a novel framework that systematically formulates a step-level Markov decision process (MDP) and introduces a step-level credit assignment mechanism, thereby elevating the optimization granularity of reinforcement learning from individual tokens to coherent reasoning steps—aligning more closely with the agent’s natural decision logic. Coupled with a step-aligned policy optimization algorithm and a tailored system design, preliminary experiments demonstrate that StepPO substantially enhances both the decision-making capabilities and tool-utilization performance of LLM agents.

Technology Category

Application Category

📝 Abstract
General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.
Problem

Research questions and friction points this paper is trying to address.

Agentic Reinforcement Learning
step-level MDP
credit assignment
LLM agents
delayed sparse rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

StepPO
Agentic Reinforcement Learning
step-level MDP
credit assignment
Large Language Models
D
Daoyu Wang
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
Q
Qingchuan Li
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
M
Mingyue Cheng
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
J
Jie Ouyang
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
S
Shuo Yu
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
Qi Liu
Qi Liu
University of Science and Technology of China
Data MiningEducational Big DataRecommender SystemsSocial Network Analysis
Enhong Chen
Enhong Chen
University of Science and Technology of China
data miningrecommender systemmachine learning