Post-Convergence Sim-to-Real Policy Transfer: A Principled Alternative to Cherry-Picking

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of robust sim-to-real transfer for reinforcement learning policies that converge in simulation but fail to generalize reliably on physical robots. We propose a post-convergence robust transfer paradigm that replaces heuristic, simulation-performance-based policy selection with a theoretically grounded optimization framework. Specifically, we formulate policy selection as a convex quadratically constrained linear program, optimizing for worst-case real-world performance—thereby providing provable robustness guarantees. Our method integrates convex optimization, worst-case performance modeling, and empirical policy evaluation, eliminating ad hoc “cherry-picking.” Evaluated on legged robot locomotion control tasks, the approach significantly improves deployment success rates in the real world. Experiments demonstrate consistent superiority over conventional selection strategies that prioritize policies with the highest simulated reward.

Technology Category

Application Category

📝 Abstract
Learning-based approaches, particularly reinforcement learning (RL), have become widely used for developing control policies for autonomous agents, such as locomotion policies for legged robots. RL training typically maximizes a predefined reward (or minimizes a corresponding cost/loss) by iteratively optimizing policies within a simulator. Starting from a randomly initialized policy, the empirical expected reward follows a trajectory with an overall increasing trend. While some policies become temporarily stuck in local optima, a well-defined training process generally converges to a reward level with noisy oscillations. However, selecting a policy for real-world deployment is rarely an analytical decision (i.e., simply choosing the one with the highest reward) and is instead often performed through trial and error. To improve sim-to-real transfer, most research focuses on the pre-convergence stage, employing techniques such as domain randomization, multi-fidelity training, adversarial training, and architectural innovations. However, these methods do not eliminate the inevitable convergence trajectory and noisy oscillations of rewards, leading to heuristic policy selection or cherry-picking. This paper addresses the post-convergence sim-to-real transfer problem by introducing a worst-case performance transference optimization approach, formulated as a convex quadratic-constrained linear programming problem. Extensive experiments demonstrate its effectiveness in transferring RL-based locomotion policies from simulation to real-world laboratory tests.
Problem

Research questions and friction points this paper is trying to address.

Optimizing post-convergence sim-to-real policy transfer
Avoiding heuristic policy selection in RL training
Improving worst-case performance for real-world deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Worst-case performance transference optimization approach
Convex quadratic-constrained linear programming formulation
Post-convergence sim-to-real transfer solution
🔎 Similar Papers
No similar papers found.
D
Dylan Khor
Department of Computer Science, Iowa State University, IA, USA
Bowen Weng
Bowen Weng
Assistant Professor, Iowa State University
TestingRoboticsSafe AutonomyControl