🤖 AI Summary
In legged robot reinforcement learning (RL), domain randomization improves robustness but compromises locomotion efficiency and tracking accuracy under nominal conditions, yielding overly conservative policies. Method: We propose a two-stage decoupled framework: first, an optimal policy derived from a nominal dynamics model generates “imagined transitions”; second, these transitions are explicitly injected as demonstration signals into the RL training pipeline. Contribution/Results: This is the first work to formalize imagined transitions as demonstrative inputs for quadrupedal locomotion learning, breaking the strong coupling between robustness and optimality inherent in conventional domain randomization. Our approach integrates RL, optimal control, and Sim2Real transfer techniques. Experiments demonstrate accelerated training convergence, significantly reduced in-distribution trajectory tracking error, enhanced out-of-distribution environmental adaptability, and—crucially—recovery of near-optimal motion performance while preserving high robustness.
📝 Abstract
Existing quadrupedal locomotion learning paradigms usually rely on extensive domain randomization to alleviate the sim2real gap and enhance robustness. It trains policies with a wide range of environment parameters and sensor noises to perform reliably under uncertainty. However, since optimal performance under ideal conditions often conflicts with the need to handle worst-case scenarios, there is a trade-off between optimality and robustness. This trade-off forces the learned policy to prioritize stability in diverse and challenging conditions over efficiency and accuracy in ideal ones, leading to overly conservative behaviors that sacrifice peak performance. In this paper, we propose a two-stage framework that mitigates this trade-off by integrating policy learning with imagined transitions. This framework enhances the conventional reinforcement learning (RL) approach by incorporating imagined transitions as demonstrative inputs. These imagined transitions are derived from an optimal policy and a dynamics model operating within an idealized setting. Our findings indicate that this approach significantly mitigates the domain randomization-induced negative impact of existing RL algorithms. It leads to accelerated training, reduced tracking errors within the distribution, and enhanced robustness outside the distribution.