A KL-regularization framework for learning to plan with adaptive priors

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low exploration efficiency and poor sample utilization of model-based reinforcement learning (MBRL) in high-dimensional continuous control, this paper proposes PO-MPC—a unified framework that treats the action distribution output by a model predictive control (MPC) planner as an adaptive prior. Policy optimization is guided via KL divergence regularization, enabling dynamic trade-offs between return maximization and behavioral distribution alignment. The method integrates Model Predictive Path Integral (MPPI) planning, deterministic policy gradient–based value learning, and entropy regularization. Crucially, it is the first work to systematically model the planning-induced action distribution as a learnable prior, thereby unifying and generalizing multiple MBRL paradigms. Evaluated on standard continuous-control benchmarks, PO-MPC significantly improves MPPI-based RL performance and establishes new state-of-the-art results for this class of methods.

Technology Category

Application Category

📝 Abstract
Effective exploration remains a central challenge in model-based reinforcement learning (MBRL), particularly in high-dimensional continuous control tasks where sample efficiency is crucial. A prominent line of recent work leverages learned policies as proposal distributions for Model-Predictive Path Integral (MPPI) planning. Initial approaches update the sampling policy independently of the planner distribution, typically maximizing a learned value function with deterministic policy gradient and entropy regularization. However, because the states encountered during training depend on the MPPI planner, aligning the sampling policy with the planner improves the accuracy of value estimation and long-term performance. To this end, recent methods update the sampling policy by minimizing KL divergence to the planner distribution or by introducing planner-guided regularization into the policy update. In this work, we unify these MPPI-based reinforcement learning methods under a single framework by introducing Policy Optimization-Model Predictive Control (PO-MPC), a family of KL-regularized MBRL methods that integrate the planner's action distribution as a prior in policy optimization. By aligning the learned policy with the planner's behavior, PO-MPC allows more flexibility in the policy updates to trade off Return maximization and KL divergence minimization. We clarify how prior approaches emerge as special cases of this family, and we explore previously unstudied variations. Our experiments show that these extended configurations yield significant performance improvements, advancing the state of the art in MPPI-based RL.
Problem

Research questions and friction points this paper is trying to address.

Improving exploration in model-based reinforcement learning
Aligning sampling policies with planner distributions
Integrating KL regularization for flexible policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

KL-regularized MBRL framework integrates planner distribution as prior
Policy Optimization-MPC aligns learned policy with planner behavior
Flexible policy updates balance return and divergence objectives
🔎 Similar Papers
No similar papers found.
Á
Álvaro Serra-Gomez
Leiden University
D
Daniel Jarne Ornia
University of Oxford
Dhruva Tirumala
Dhruva Tirumala
DeepMind
T
Thomas Moerland
Leiden University