STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization

๐Ÿ“… 2025-11-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Online reinforcement learning suffers from low sampling efficiency, erroneous penalization of correct actions in failed trajectories, and high sample costs in multi-turn interactions. To address these challenges, we propose STEP, a novel framework featuring three key innovations: (1) a task-level success-rate-guided adaptive resampling mechanism that dynamically prioritizes harder tasks; (2) a step-level advantage function weighted by task success rates, enabling fine-grained policy optimization; and (3) integrated trajectory decomposition, step-level GRPO enhancement, and smoothed success-rate tracking. Experiments on OSWorld and AndroidWorld demonstrate that STEP significantly outperforms trajectory-level GRPOโ€”achieving higher sample efficiency, improved training stability, faster convergence, and stronger generalization across diverse tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Multi-turn interaction remains challenging for online reinforcement learning. A common solution is trajectory-level optimization, which treats each trajectory as a single training sample. However, this approach can be inefficient and yield misleading learning signals: it applies uniform sampling across tasks regardless of difficulty, penalizes correct intermediate actions in failed trajectories, and incurs high sample-collection costs. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy optimization), a framework that dynamically allocates sampling based on per-task success rates and performs step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples. Finally, it applies a step-level GRPO augmentation to refine updates for low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.
Problem

Research questions and friction points this paper is trying to address.

Optimizes multi-turn online RL by dynamically allocating sampling based on task difficulty
Addresses inefficiency of uniform trajectory sampling and misleading penalization of correct actions
Improves sample efficiency and training stability through step-level policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic sampling allocation based on task success rates
Step-level optimization with success-rate-weighted advantages
GRPO augmentation for refining low-success task updates
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yuhan Chen
MiLM Plus, Xiaomi Inc.
Y
Yuxuan Liu
Renmin University of China
L
Long Zhang
Wuhan University
Pengzhi Gao
Pengzhi Gao
Xiaomi LLM Team
Machine LearningNatural Language ProcessingHigh Dimensional DataSignal Processing
Jian Luan
Jian Luan
Toshiba, Microsoft, Xiaomi
LLMVLMTTSSinging Synthesis
W
Wei Liu
MiLM Plus, Xiaomi Inc.