Harnessing Bounded-Support Evolution Strategies for Policy Refinement

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the weak, noisy, and high-variance gradient signals inherent in on-policy reinforcement learning for robot policy optimization, this paper proposes a two-stage policy refinement framework: first pretraining an initial policy via Proximal Policy Optimization (PPO), then applying a gradient-free refinement using Triangular-Distribution Evolutionary Strategy (TD-ES). TD-ES employs bounded triangular-distribution perturbations, symmetric sampling, and a center-rank-based finite-difference estimator to ensure bounded exploration while substantially reducing gradient estimation variance. The method is fully gradient-free, highly parallelizable, and computationally lightweight. Evaluated across multiple robotic manipulation tasks, it achieves an average 26.5% improvement in policy success rate over the PPO-only baseline, alongside a marked reduction in training variance. This work establishes a novel paradigm for high-reliability, sample-efficient policy optimization in robotics.

Technology Category

Application Category

📝 Abstract

Improving competent robot policies with on-policy RL is often hampered by noisy, low-signal gradients. We revisit Evolution Strategies (ES) as a policy-gradient proxy and localize exploration with bounded, antithetic triangular perturbations, suitable for policy refinement. We propose Triangular-Distribution ES (TD-ES) which pairs bounded triangular noise with a centered-rank finite-difference estimator to deliver stable, parallelizable, gradient-free updates. In a two-stage pipeline - PPO pretraining followed by TD-ES refinement - this preserves early sample efficiency while enabling robust late-stage gains. Across a suite of robotic manipulation tasks, TD-ES raises success rates by 26.5% relative to PPO and greatly reduces variance, offering a simple, compute-light path to reliable refinement.

Problem

Research questions and friction points this paper is trying to address.

Improving robot policies with noisy reinforcement learning gradients

Developing stable evolution strategies for policy refinement

Enhancing robotic manipulation success rates with efficient methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses bounded triangular noise for exploration

Employs centered-rank finite-difference estimator

Combines PPO pretraining with ES refinement

🔎 Similar Papers

No similar papers found.