🤖 AI Summary
Deep reinforcement learning (DRL) policies for underactuated robotic control often suffer from misalignment with task objectives and poor robustness. Method: This paper proposes a zeroth-order fine-tuning approach based on Separable Natural Evolution Strategies (SNES), operating directly on a pre-trained Soft Actor-Critic (SAC) policy. Instead of relying on gradient estimation—which introduces bias—the method optimizes the original task reward metric end-to-end. A surrogate reward function is introduced to approximate true performance, accelerating evolutionary search without policy reparameterization or architectural modification. Contribution/Results: The method significantly improves control accuracy and robustness in complex, dynamic environments. Evaluated on the IROS 2024 RealAIGym competition benchmark, it substantially outperforms baseline methods, achieving state-of-the-art scores—demonstrating both effectiveness and strong generalization capability.
📝 Abstract
Deep Reinforcement Learning (RL) has emerged as a powerful method for addressing complex control problems, particularly those involving underactuated robotic systems. However, in some cases, policies may require refinement to achieve optimal performance and robustness aligned with specific task objectives. In this paper, we propose an approach for fine-tuning Deep RL policies using Evolutionary Strategies (ES) to enhance control performance for underactuated robots. Our method involves initially training an RL agent with Soft-Actor Critic (SAC) using a surrogate reward function designed to approximate complex specific scoring metrics. We subsequently refine this learned policy through a zero-order optimization step employing the Separable Natural Evolution Strategy (SNES), directly targeting the original score. Experimental evaluations conducted in the context of the 2nd AI Olympics with RealAIGym at IROS 2024 demonstrate that our evolutionary fine-tuning significantly improves agent performance while maintaining high robustness. The resulting controllers outperform established baselines, achieving competitive scores for the competition tasks.