Advantage-Guided Diffusion for Model-Based Reinforcement Learning

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the error accumulation inherent in autoregressive world models for reinforcement learning and the limitations of existing diffusion-based guidance methods, which either neglect long-horizon returns or discard value information under short planning horizons. To overcome these issues, the authors propose an advantage-guided diffusion mechanism that, for the first time, integrates the state-action advantage function into the reverse diffusion process. They introduce Sigmoid and exponential advantage guidance strategies that seamlessly operate within the PolyGRAD framework without altering its original diffusion training objective. The approach is theoretically guaranteed to ensure policy improvement and mitigate myopia. Empirical results on MuJoCo benchmarks demonstrate substantial improvements over strong baselines—including PolyGRAD, online Diffuser-style reward guidance, and PPO/TRPO—with up to a two-fold gain in both sample efficiency and final return.

Technology Category

Application Category

📝 Abstract
Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.
Problem

Research questions and friction points this paper is trying to address.

model-based reinforcement learning
diffusion models
short-horizon myopia
advantage estimation
trajectory generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage-Guided Diffusion
Model-Based Reinforcement Learning
Diffusion Models
Trajectory Optimization
Policy Improvement
🔎 Similar Papers
No similar papers found.