🤖 AI Summary
Addressing the dual challenges of low sample efficiency and poor cross-task generalization in multi-task reinforcement learning, this paper proposes a scalable model-based policy optimization framework. Methodologically, it introduces: (1) an implicit world model that learns task-relevant dynamic representations without explicit observation reconstruction; (2) a bias-driven hybrid exploration mechanism integrating model-based planning with uncertainty-aware reward shaping to mitigate the trade-off between model bias and value estimation variance; and (3) a trust-region optimizer coupled with task-outcome prediction modeling to enhance policy stability and generalization. Evaluated on multiple standard multi-task benchmarks, the approach achieves state-of-the-art cross-task generalization performance while improving sample efficiency by an average factor of 2.3× compared to prior methods.
📝 Abstract
We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning (MBRL) framework designed to address sample inefficiency in single-task settings and poor generalization in multi-task domains. Existing model-based approaches like DreamerV3 rely on pixel-level generative models that neglect control-centric representations, while model-free methods such as PPO suffer from high sample complexity and weak exploration. M3PO integrates an implicit world model, trained to predict task outcomes without observation reconstruction, with a hybrid exploration strategy that combines model-based planning and model-free uncertainty-driven bonuses. This eliminates the bias-variance trade-off in prior methods by using discrepancies between model-based and model-free value estimates to guide exploration, while maintaining stable policy updates through a trust-region optimizer. M3PO provides an efficient and robust alternative to existing model-based policy optimization approaches and achieves state-of-the-art performance across multiple benchmarks.