StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing token-based reinforcement learning approaches struggle to effectively model the decision-making process of large language model (LLM) agents in multi-turn interactions, particularly under conditions of delayed sparse rewards and highly variable context lengths. This work proposes StepPO, a novel framework that systematically formulates a step-level Markov decision process (MDP) and introduces a step-level credit assignment mechanism, thereby elevating the optimization granularity of reinforcement learning from individual tokens to coherent reasoning steps—aligning more closely with the agent’s natural decision logic. Coupled with a step-aligned policy optimization algorithm and a tailored system design, preliminary experiments demonstrate that StepPO substantially enhances both the decision-making capabilities and tool-utilization performance of LLM agents.

Technology Category

Application Category

📝 Abstract

General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.

Problem

Research questions and friction points this paper is trying to address.

Agentic Reinforcement Learning

step-level MDP

credit assignment

LLM agents

delayed sparse rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

StepPO

Agentic Reinforcement Learning

step-level MDP