🤖 AI Summary
This work addresses the inefficiency in visual-language-action (VLA) reinforcement learning, where conventional methods uniformly compute gradients across all trajectory segments—even those already mastered—leading to suboptimal training efficiency. To overcome this, the authors propose Probabilistic Chunk Masking (PCM), a novel approach that selectively focuses gradient updates on semantically critical trajectory segments where successful and failed trajectories diverge, without requiring a reward model or critic. Built upon the GRPO framework, PCM leverages the variance between successful and failed actions as a proxy for gradient variance to dynamically sample and backpropagate only the most informative trajectory chunks. Evaluated on three LIBERO benchmarks, PCM achieves comparable success rates to standard GRPO while delivering a 2.38× end-to-end speedup, a 4.8× acceleration in gradient updates, and a 60% reduction in peak activation memory.
📝 Abstract
Reinforcement learning (RL) allows vision-language-action (VLA) policies to generalize beyond their training distribution by optimizing directly for task success, but post-training is computationally expensive. A natural response has been to speed rollout collection through faster simulators and world models. In GRPO-based VLA RL, we find that the dominant cost lies elsewhere: gradient computation accounts for approximately 78% of wall-clock time per step in our runs, while rollout collection accounts for only 21%. Gradient cost dominates because much of this computation is spent on phases that contribute little to learning. GRPO's learning signal is driven by advantage variance: only phases where successful and failed rollouts diverge produce learning signal. However, GRPO assigns the same advantage to every chunk in a rollout. As a result, actor-update compute is spent uniformly across the trajectory, including phases the policy already handles after pre-training and supervised fine-tuning. This paper presents Probabilistic Chunk Masking (PCM), a drop-in modification to GRPO that allocates gradient computation to a small, probabilistically selected subset of chunks per trajectory. PCM scores semantic phases using success-failure action variance, a rollout-derived proxy for per-phase gradient variance, and samples a fixed chunk budget with online-updated phase-level keep probabilities. We formalize per-phase gradient variance as the quantity determines where gradient computation is useful and show that success-failure action variance provides a measurable proxy for it. PCM requires no reward model or learned critic. On three LIBERO benchmarks, PCM matches the final success rate of standard GRPO while achieving 2.38 times wall-clock speedup, 4.8 times faster gradient updates, and 60% lower peak activation memory, while backpropagating through fewer than 20% of trajectory chunks.