Harnessing Bounded-Support Evolution Strategies for Policy Refinement

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the weak, noisy, and high-variance gradient signals inherent in on-policy reinforcement learning for robot policy optimization, this paper proposes a two-stage policy refinement framework: first pretraining an initial policy via Proximal Policy Optimization (PPO), then applying a gradient-free refinement using Triangular-Distribution Evolutionary Strategy (TD-ES). TD-ES employs bounded triangular-distribution perturbations, symmetric sampling, and a center-rank-based finite-difference estimator to ensure bounded exploration while substantially reducing gradient estimation variance. The method is fully gradient-free, highly parallelizable, and computationally lightweight. Evaluated across multiple robotic manipulation tasks, it achieves an average 26.5% improvement in policy success rate over the PPO-only baseline, alongside a marked reduction in training variance. This work establishes a novel paradigm for high-reliability, sample-efficient policy optimization in robotics.

Technology Category

Application Category

📝 Abstract
Improving competent robot policies with on-policy RL is often hampered by noisy, low-signal gradients. We revisit Evolution Strategies (ES) as a policy-gradient proxy and localize exploration with bounded, antithetic triangular perturbations, suitable for policy refinement. We propose Triangular-Distribution ES (TD-ES) which pairs bounded triangular noise with a centered-rank finite-difference estimator to deliver stable, parallelizable, gradient-free updates. In a two-stage pipeline - PPO pretraining followed by TD-ES refinement - this preserves early sample efficiency while enabling robust late-stage gains. Across a suite of robotic manipulation tasks, TD-ES raises success rates by 26.5% relative to PPO and greatly reduces variance, offering a simple, compute-light path to reliable refinement.
Problem

Research questions and friction points this paper is trying to address.

Improving robot policies with noisy reinforcement learning gradients
Developing stable evolution strategies for policy refinement
Enhancing robotic manipulation success rates with efficient methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses bounded triangular noise for exploration
Employs centered-rank finite-difference estimator
Combines PPO pretraining with ES refinement
🔎 Similar Papers
No similar papers found.
E
Ethan Hirschowitz
The University of Sydney, Australia
Fabio Ramos
Fabio Ramos
University of Sydney and NVIDIA
roboticsmachine learning