🤖 AI Summary
This work addresses the computational inefficiency often encountered when combining reinforcement learning with model predictive control (MPC) in large-scale settings. The authors propose an efficient policy synthesis framework that jointly learns in a soft value space, leveraging sampling-based planning for online control and value-target generation. An amortized warm-start mechanism reuses historical action sequences from prior planning iterations, while aligning the terminal Q-function with a short-horizon MPC to implicitly extend the effective planning horizon. Integrating model predictive path integral (MPPI) control, fitted value iteration, ensemble dynamics models, and scenario-based planning, the method demonstrates superior sample efficiency, robustness, and scalability across both classical and complex control tasks.
📝 Abstract
Reinforcement learning (RL) and model predictive control (MPC) offer complementary strengths, yet combining them at scale remains computationally challenging. We propose soft MPCritic, an RL-MPC framework that learns in (soft) value space while using sample-based planning for both online control and value target generation. soft MPCritic instantiates MPC through model predictive path integral control (MPPI) and trains a terminal Q-function with fitted value iteration, aligning the learned value function with the planner and implicitly extending the effective planning horizon. We introduce an amortized warm-start strategy that recycles planned open-loop action sequences from online observations when computing batched MPPI-based value targets. This makes soft MPCritic computationally practical, while preserving solution quality. soft MPCritic plans in a scenario-based fashion with an ensemble of dynamic models trained for next-step prediction accuracy. Together, these ingredients enable soft MPCritic to learn effectively through robust, short-horizon planning on classic and complex control tasks. These results establish soft MPCritic as a practical and scalable blueprint for synthesizing MPC policies in settings where policy extraction and direct, long-horizon planning may fail.