Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the challenge of inefficient exploration in sparse-reward reinforcement learning by proposing a novel exploration framework based on Optimistic World Models (OWMs). It introduces, for the first time, Reward-Biased Maximum Likelihood Estimation (RBMLE)—a classical control theory technique—into deep reinforcement learning. The method injects optimism directly during model learning, encouraging the agent to imagine high-reward transition trajectories and thereby enabling efficient exploration. Its key innovation lies in a fully differentiable optimistic mechanism that requires neither explicit uncertainty estimation nor constrained optimization; instead, it only adds an optimistic dynamics loss to standard training procedures, making it plug-and-play compatible with state-of-the-art world models such as DreamerV3 and STORM. Experiments demonstrate significant improvements in sample efficiency and cumulative reward across multiple benchmark environments, outperforming the original baselines.

Technology Category

Application Category

📝 Abstract

Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.

Problem

Research questions and friction points this paper is trying to address.

efficient exploration

reinforcement learning

sparse-reward environments

model-based reinforcement learning

optimistic exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic World Models

Reward-Biased Maximum Likelihood Estimation

Model-Based Reinforcement Learning