Post-Convergence Sim-to-Real Policy Transfer: A Principled Alternative to Cherry-Picking

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the challenge of robust sim-to-real transfer for reinforcement learning policies that converge in simulation but fail to generalize reliably on physical robots. We propose a post-convergence robust transfer paradigm that replaces heuristic, simulation-performance-based policy selection with a theoretically grounded optimization framework. Specifically, we formulate policy selection as a convex quadratically constrained linear program, optimizing for worst-case real-world performance—thereby providing provable robustness guarantees. Our method integrates convex optimization, worst-case performance modeling, and empirical policy evaluation, eliminating ad hoc “cherry-picking.” Evaluated on legged robot locomotion control tasks, the approach significantly improves deployment success rates in the real world. Experiments demonstrate consistent superiority over conventional selection strategies that prioritize policies with the highest simulated reward.

Technology Category

Application Category

📝 Abstract

Learning-based approaches, particularly reinforcement learning (RL), have become widely used for developing control policies for autonomous agents, such as locomotion policies for legged robots. RL training typically maximizes a predefined reward (or minimizes a corresponding cost/loss) by iteratively optimizing policies within a simulator. Starting from a randomly initialized policy, the empirical expected reward follows a trajectory with an overall increasing trend. While some policies become temporarily stuck in local optima, a well-defined training process generally converges to a reward level with noisy oscillations. However, selecting a policy for real-world deployment is rarely an analytical decision (i.e., simply choosing the one with the highest reward) and is instead often performed through trial and error. To improve sim-to-real transfer, most research focuses on the pre-convergence stage, employing techniques such as domain randomization, multi-fidelity training, adversarial training, and architectural innovations. However, these methods do not eliminate the inevitable convergence trajectory and noisy oscillations of rewards, leading to heuristic policy selection or cherry-picking. This paper addresses the post-convergence sim-to-real transfer problem by introducing a worst-case performance transference optimization approach, formulated as a convex quadratic-constrained linear programming problem. Extensive experiments demonstrate its effectiveness in transferring RL-based locomotion policies from simulation to real-world laboratory tests.

Problem

Research questions and friction points this paper is trying to address.

Optimizing post-convergence sim-to-real policy transfer

Avoiding heuristic policy selection in RL training

Improving worst-case performance for real-world deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Worst-case performance transference optimization approach

Convex quadratic-constrained linear programming formulation

Post-convergence sim-to-real transfer solution

🔎 Similar Papers

No similar papers found.

Field AI

Irvine, CA

1.51 Robotics Autonomy Engineer - Locomotion

Field AI

Irvine, CA

Research Scientist Intern, Robotic Control Policy (PhD)