🤖 AI Summary
Model-based reinforcement learning (MBRL) suffers from policy bias due to inaccurate dynamics models—especially in data-scarce, high-uncertainty regions where modeling errors are large and multi-step predictions degrade significantly.
Method: We propose a unified framework integrating uncertainty-aware *k*-step forward planning with active exploration. For the first time, we jointly incorporate model uncertainty and value function error into the *k*-step planning objective. Additionally, we design an uncertainty-driven active sampling strategy that selectively collects transitions from high-uncertainty states during policy optimization to improve model fidelity.
Contribution/Results: Our approach departs from conventional passive uncertainty estimation by enabling co-optimization of planning and model learning. Evaluated on robotic manipulation and Atari benchmarks, it achieves state-of-the-art performance with fewer environment interactions, while substantially improving multi-step prediction accuracy and policy robustness.
📝 Abstract
Model-based reinforcement learning (MBRL) has demonstrated superior sample efficiency compared to model-free reinforcement learning (MFRL). However, the presence of inaccurate models can introduce biases during policy learning, resulting in misleading trajectories. The challenge lies in obtaining accurate models due to limited diverse training data, particularly in regions with limited visits (uncertain regions). Existing approaches passively quantify uncertainty after sample generation, failing to actively collect uncertain samples that could enhance state coverage and improve model accuracy. Moreover, MBRL often faces difficulties in making accurate multi-step predictions, thereby impacting overall performance. To address these limitations, we propose a novel framework for uncertainty-aware policy optimization with model-based exploratory planning. In the model-based planning phase, we introduce an uncertainty-aware k-step lookahead planning approach to guide action selection at each step. This process involves a trade-off analysis between model uncertainty and value function approximation error, effectively enhancing policy performance. In the policy optimization phase, we leverage an uncertainty-driven exploratory policy to actively collect diverse training samples, resulting in improved model accuracy and overall performance of the RL agent. Our approach offers flexibility and applicability to tasks with varying state/action spaces and reward structures. We validate its effectiveness through experiments on challenging robotic manipulation tasks and Atari games, surpassing state-of-the-art methods with fewer interactions, thereby leading to significant performance improvements.