🤖 AI Summary
In offline reinforcement learning, policy optimization with diffusion models is hindered by behavioral policy mismatch and unknown importance weights $w$. This paper proposes a self-guiding mechanism based on joint action-weight diffusion modeling, introducing the Self-Weighted Guidance (SWG) paradigm: a single diffusion model simultaneously generates actions and their corresponding importance weights, implicitly encoding policy guidance signals without requiring auxiliary critic networks or explicit weight estimation modules. Leveraging score-based sampling, SWG enables end-to-end, unbiased policy improvement. The method achieves state-of-the-art performance across multiple challenging benchmarks in D4RL. A controlled toy experiment validates consistency of the learned joint action-weight distribution, while ablation studies confirm the critical role of weight modeling and demonstrate scalability to diverse environments and dataset sizes.
📝 Abstract
Offline reinforcement learning (RL) recovers the optimal policy $pi$ given historical observations of an agent. In practice, $pi$ is modeled as a weighted version of the agent's behavior policy $mu$, using a weight function $w$ working as a critic of the agent's behavior. Though recent approaches to offline RL based on diffusion models have exhibited promising results, the computation of the required scores is challenging due to their dependence on the unknown $w$. In this work, we alleviate this issue by constructing a diffusion over both the actions and the weights. With the proposed setting, the required scores are directly obtained from the diffusion model without learning extra networks. Our main conceptual contribution is a novel guidance method, where guidance (which is a function of $w$) comes from the same diffusion model, therefore, our proposal is termed Self-Weighted Guidance (SWG). We show that SWG generates samples from the desired distribution on toy examples and performs on par with state-of-the-art methods on D4RL's challenging environments, while maintaining a streamlined training pipeline. We further validate SWG through ablation studies on weight formulations and scalability.