🤖 AI Summary
To address the slow convergence of MC-PILCO in policy optimization, this paper introduces the nonlinear trajectory optimization method iLQR into its framework for the first time, proposing an exploration-enhanced co-optimization mechanism: iLQR generates high-information initial trajectories to initialize the policy, which are jointly optimized with a Gaussian process dynamics model. This approach overcomes the dual bottlenecks of sample efficiency and optimization speed inherent in conventional model-based reinforcement learning, significantly accelerating convergence while maintaining task success rates. In the Cart-Pole benchmark, the method reduces execution time by 45.9%, achieves 100% success across four independent trials, and decreases iteration count without compromising overall solution speed. These results demonstrate a synergistic improvement in both computational efficiency and robustness.
📝 Abstract
This paper addresses the slow policy optimization convergence of Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO), a state-of-the-art model-based reinforcement learning (MBRL) algorithm, by integrating it with iterative Linear Quadratic Regulator (iLQR), a fast trajectory optimization method suitable for nonlinear systems. The proposed method, Exploration-Boosted MC-PILCO (EB-MC-PILCO), leverages iLQR to generate informative, exploratory trajectories and initialize the policy, significantly reducing the number of required optimization steps. Experiments on the cart-pole task demonstrate that EB-MC-PILCO accelerates convergence compared to standard MC-PILCO, achieving up to $m{45.9%}$ reduction in execution time when both methods solve the task in four trials. EB-MC-PILCO also maintains a $m{100%}$ success rate across trials while solving the task faster, even in cases where MC-PILCO converges in fewer iterations.