Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

๐Ÿ“… 2026-02-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of inefficient exploration in sparse-reward reinforcement learning by proposing a novel exploration framework based on Optimistic World Models (OWMs). It introduces, for the first time, Reward-Biased Maximum Likelihood Estimation (RBMLE)โ€”a classical control theory techniqueโ€”into deep reinforcement learning. The method injects optimism directly during model learning, encouraging the agent to imagine high-reward transition trajectories and thereby enabling efficient exploration. Its key innovation lies in a fully differentiable optimistic mechanism that requires neither explicit uncertainty estimation nor constrained optimization; instead, it only adds an optimistic dynamics loss to standard training procedures, making it plug-and-play compatible with state-of-the-art world models such as DreamerV3 and STORM. Experiments demonstrate significant improvements in sample efficiency and cumulative reward across multiple benchmark environments, outperforming the original baselines.

Technology Category

Application Category

๐Ÿ“ Abstract
Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.
Problem

Research questions and friction points this paper is trying to address.

efficient exploration
reinforcement learning
sparse-reward environments
model-based reinforcement learning
optimistic exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic World Models
Reward-Biased Maximum Likelihood Estimation
Model-Based Reinforcement Learning
Efficient Exploration
Gradient-Based Optimism
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Akshay Mete
Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, USA
S
Shahid Aamir Sheikh
Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, USA
T
Tzu-Hsiang Lin
Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, USA
Dileep Kalathil
Dileep Kalathil
Texas A&M University
Reinforcement LearningMachine LearningStochastic Control
P. R. Kumar
P. R. Kumar
Texas A&M University
Learning TheoryWirelessNetwork Information TheoryStochastic ControlCyberphysical Systems