๐ค AI Summary
This work addresses the challenge of inefficient exploration in sparse-reward reinforcement learning by proposing a novel exploration framework based on Optimistic World Models (OWMs). It introduces, for the first time, Reward-Biased Maximum Likelihood Estimation (RBMLE)โa classical control theory techniqueโinto deep reinforcement learning. The method injects optimism directly during model learning, encouraging the agent to imagine high-reward transition trajectories and thereby enabling efficient exploration. Its key innovation lies in a fully differentiable optimistic mechanism that requires neither explicit uncertainty estimation nor constrained optimization; instead, it only adds an optimistic dynamics loss to standard training procedures, making it plug-and-play compatible with state-of-the-art world models such as DreamerV3 and STORM. Experiments demonstrate significant improvements in sample efficiency and cumulative reward across multiple benchmark environments, outperforming the original baselines.
๐ Abstract
Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.