DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing diffusion-based world models struggle to directly generate actions in offline RL, rendering them incompatible with standard one-step temporal-difference (TD) algorithms such as TD3BC and IQL; jointly modeling state–reward–action triples further induces training instability and performance degradation. To address this, we propose DAWM—a modular diffusion world model that first employs a conditional diffusion model to efficiently synthesize high-fidelity future state–reward trajectories, then applies a lightweight inverse dynamics model (IDM) to precisely infer the corresponding actions, thereby constructing complete synthetic transition sequences. This design decouples trajectory generation from action inference, achieving a favorable trade-off among modeling fidelity, training efficiency, and TD compatibility. Evaluated on multiple D4RL benchmarks, DAWM significantly outperforms prior diffusion world models and substantially improves offline policy learning performance of TD3BC and IQL.

Technology Category

Application Category

📝 Abstract

Diffusion-based world models have demonstrated strong capabilities in synthesizing realistic long-horizon trajectories for offline reinforcement learning (RL). However, many existing methods do not directly generate actions alongside states and rewards, limiting their compatibility with standard value-based offline RL algorithms that rely on one-step temporal difference (TD) learning. While prior work has explored joint modeling of states, rewards, and actions to address this issue, such formulations often lead to increased training complexity and reduced performance in practice. We propose extbf{DAWM}, a diffusion-based world model that generates future state-reward trajectories conditioned on the current state, action, and return-to-go, paired with an inverse dynamics model (IDM) for efficient action inference. This modular design produces complete synthetic transitions suitable for one-step TD-based offline RL, enabling effective and computationally efficient training. Empirically, we show that conservative offline RL algorithms such as TD3BC and IQL benefit significantly from training on these augmented trajectories, consistently outperforming prior diffusion-based baselines across multiple tasks in the D4RL benchmark.

Problem

Research questions and friction points this paper is trying to address.

Generating synthetic transitions compatible with TD-based offline RL

Addressing training complexity from joint state-reward-action modeling

Enhancing offline RL performance via diffusion-generated trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates state-reward trajectories using diffusion models

Infers actions with a separate inverse dynamics model

Produces synthetic transitions for TD-based offline RL

🔎 Similar Papers

State-Constrained Offline Reinforcement Learning