π€ AI Summary
This work addresses the inconsistency in successor state measure (SSM) modeling and the difficulty of integrating dynamic programming principles into policy representation. We introduce diffusion models to SSM estimation for the first time and propose the Bellman Flow Constraintβa novel, explicit Bellman update mechanism imposed at each diffusion step. This constraint enforces that the intermediate diffusion distributions satisfy the policy-dependent Bellman equation, thereby ensuring theoretical consistency of SSM estimation. The method operates offline without environment interaction, making it suitable for offline reinforcement learning. It achieves significant improvements over existing SSM and diffusion-based baselines on policy evaluation and cross-task transfer, validated on standard benchmarks including D4RL and Offline RL Gym. Our core contribution is a theoretically grounded framework coupling diffusion processes with dynamic programming, yielding a differentiable and falsifiable Bellman-consistent SSM estimator.
π Abstract
Diffusion models have seen tremendous success as generative architectures. Recently, they have been shown to be effective at modelling policies for offline reinforcement learning and imitation learning. We explore using diffusion as a model class for the successor state measure (SSM) of a policy. We find that enforcing the Bellman flow constraints leads to a simple Bellman update on the diffusion step distribution.