PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of low sample efficiency and suboptimal performance in partially observable Markov decision processes (POMDPs), where agents operate with incomplete state information. To overcome these limitations, the paper introduces Planner-to-Policy Soft Actor-Critic (P2P-SAC), a novel algorithm that leverages a privileged model predictive control (MPC) planner—feasible at any time step—as a teacher during training. This planner exploits full state access and an approximate dynamics model to guide, via knowledge distillation, a student policy that relies solely on partial observations. Theoretical analysis and empirical results demonstrate that P2P-SAC substantially improves both sample efficiency and policy performance. The approach exhibits high robustness and practical deployability, as validated on complex obstacle navigation tasks in NVIDIA Isaac Lab simulations and on a Unitree Go2 quadruped robot.
📝 Abstract
This paper addresses the problem of training a reinforcement learning (RL) policy under partial observability by exploiting a privileged, anytime-feasible planner agent available exclusively during training. We formalize this as a Partially Observable Markov Decision Process (POMDP) in which a planner agent with access to an approximate dynamical model and privileged state information guides a learning agent that observes only a lossy projection of the true state. To realize this framework, we introduce an anytime-feasible Model Predictive Control (MPC) algorithm that serves as the planner agent. For the learning agent, we propose Planner-to-Policy Soft Actor-Critic (P2P-SAC), a method that distills the planner agent's privileged knowledge to mitigate partial observability and thereby improve both sample efficiency and final policy performance. We support this framework with rigorous theoretical analysis. Finally, we validate our approach in simulation using NVIDIA Isaac Lab and successfully deploy it on a real-world Unitree Go2 quadruped navigating complex, obstacle-rich environments.
Problem

Research questions and friction points this paper is trying to address.

Partial Observability
Reinforcement Learning
POMDP
Policy Learning
State Estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Privileged Learning
Anytime-Feasible MPC
Partial Observability
Policy Distillation
Reinforcement Learning
🔎 Similar Papers
No similar papers found.