🤖 AI Summary
This work addresses the challenge of robust sim-to-real transfer for reinforcement learning policies that converge in simulation but fail to generalize reliably on physical robots. We propose a post-convergence robust transfer paradigm that replaces heuristic, simulation-performance-based policy selection with a theoretically grounded optimization framework. Specifically, we formulate policy selection as a convex quadratically constrained linear program, optimizing for worst-case real-world performance—thereby providing provable robustness guarantees. Our method integrates convex optimization, worst-case performance modeling, and empirical policy evaluation, eliminating ad hoc “cherry-picking.” Evaluated on legged robot locomotion control tasks, the approach significantly improves deployment success rates in the real world. Experiments demonstrate consistent superiority over conventional selection strategies that prioritize policies with the highest simulated reward.
📝 Abstract
Learning-based approaches, particularly reinforcement learning (RL), have become widely used for developing control policies for autonomous agents, such as locomotion policies for legged robots. RL training typically maximizes a predefined reward (or minimizes a corresponding cost/loss) by iteratively optimizing policies within a simulator. Starting from a randomly initialized policy, the empirical expected reward follows a trajectory with an overall increasing trend. While some policies become temporarily stuck in local optima, a well-defined training process generally converges to a reward level with noisy oscillations. However, selecting a policy for real-world deployment is rarely an analytical decision (i.e., simply choosing the one with the highest reward) and is instead often performed through trial and error. To improve sim-to-real transfer, most research focuses on the pre-convergence stage, employing techniques such as domain randomization, multi-fidelity training, adversarial training, and architectural innovations. However, these methods do not eliminate the inevitable convergence trajectory and noisy oscillations of rewards, leading to heuristic policy selection or cherry-picking. This paper addresses the post-convergence sim-to-real transfer problem by introducing a worst-case performance transference optimization approach, formulated as a convex quadratic-constrained linear programming problem. Extensive experiments demonstrate its effectiveness in transferring RL-based locomotion policies from simulation to real-world laboratory tests.