STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Online reinforcement learning suffers from low sampling efficiency, erroneous penalization of correct actions in failed trajectories, and high sample costs in multi-turn interactions. To address these challenges, we propose STEP, a novel framework featuring three key innovations: (1) a task-level success-rate-guided adaptive resampling mechanism that dynamically prioritizes harder tasks; (2) a step-level advantage function weighted by task success rates, enabling fine-grained policy optimization; and (3) integrated trajectory decomposition, step-level GRPO enhancement, and smoothed success-rate tracking. Experiments on OSWorld and AndroidWorld demonstrate that STEP significantly outperforms trajectory-level GRPO—achieving higher sample efficiency, improved training stability, faster convergence, and stronger generalization across diverse tasks.

Technology Category

Application Category

📝 Abstract

Multi-turn interaction remains challenging for online reinforcement learning. A common solution is trajectory-level optimization, which treats each trajectory as a single training sample. However, this approach can be inefficient and yield misleading learning signals: it applies uniform sampling across tasks regardless of difficulty, penalizes correct intermediate actions in failed trajectories, and incurs high sample-collection costs. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy optimization), a framework that dynamically allocates sampling based on per-task success rates and performs step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples. Finally, it applies a step-level GRPO augmentation to refine updates for low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.

Problem

Research questions and friction points this paper is trying to address.

Optimizes multi-turn online RL by dynamically allocating sampling based on task difficulty

Addresses inefficiency of uniform trajectory sampling and misleading penalization of correct actions

Improves sample efficiency and training stability through step-level policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic sampling allocation based on task success rates

Step-level optimization with success-rate-weighted advantages

GRPO augmentation for refining low-success task updates

🔎 Similar Papers

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate