M3PO: Massively Multi-Task Model-Based Policy Optimization

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the dual challenges of low sample efficiency and poor cross-task generalization in multi-task reinforcement learning, this paper proposes a scalable model-based policy optimization framework. Methodologically, it introduces: (1) an implicit world model that learns task-relevant dynamic representations without explicit observation reconstruction; (2) a bias-driven hybrid exploration mechanism integrating model-based planning with uncertainty-aware reward shaping to mitigate the trade-off between model bias and value estimation variance; and (3) a trust-region optimizer coupled with task-outcome prediction modeling to enhance policy stability and generalization. Evaluated on multiple standard multi-task benchmarks, the approach achieves state-of-the-art cross-task generalization performance while improving sample efficiency by an average factor of 2.3× compared to prior methods.

Technology Category

Application Category

📝 Abstract
We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning (MBRL) framework designed to address sample inefficiency in single-task settings and poor generalization in multi-task domains. Existing model-based approaches like DreamerV3 rely on pixel-level generative models that neglect control-centric representations, while model-free methods such as PPO suffer from high sample complexity and weak exploration. M3PO integrates an implicit world model, trained to predict task outcomes without observation reconstruction, with a hybrid exploration strategy that combines model-based planning and model-free uncertainty-driven bonuses. This eliminates the bias-variance trade-off in prior methods by using discrepancies between model-based and model-free value estimates to guide exploration, while maintaining stable policy updates through a trust-region optimizer. M3PO provides an efficient and robust alternative to existing model-based policy optimization approaches and achieves state-of-the-art performance across multiple benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addresses sample inefficiency in single-task reinforcement learning
Improves generalization in multi-task reinforcement learning domains
Resolves bias-variance trade-off in model-based policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit world model predicts task outcomes
Hybrid exploration combines planning and uncertainty
Trust-region optimizer ensures stable policy updates
🔎 Similar Papers
No similar papers found.
A
Aditya Narendra
Centre for Cognitive Modelling, Moscow Institute of Physics and Technology, 141701, Russia
Dmitry Makarov
Dmitry Makarov
Special Astrophysical Observatory of the Russian Academy of Sciences
AstrophysicsDark MatterDistance Scale
A
Aleksandr Panov
Centre for Cognitive Modelling, Moscow Institute of Physics and Technology, 141701, Russia; Federal Research Center "Computer Science and Control" RAS, 117312, Russia; AIRI, the Artificial Intelligence Research Institute, 117312, Russia