🤖 AI Summary
In offline reinforcement learning, energy-guided diffusion policy generation suffers from the intractability of intermediate energy evaluation, stemming from the difficulty of estimating the log-expectation objective during sampling. To address this, we propose Analytic Energy-guided Policy Optimization (AEPO): the first method to derive a closed-form solution for intermediate energy guidance within a conditional Gaussian diffusion framework. AEPO establishes a theoretically grounded estimator for the log-expectation objective and introduces a trainable intermediate energy network. By unifying conditional diffusion modeling, energy-based guidance, and Gaussian process analysis, AEPO eliminates the need for Monte Carlo approximation of the log-expectation. Evaluated on over 30 D4RL benchmark tasks, AEPO consistently outperforms state-of-the-art offline RL baselines, achieving significant improvements in both policy performance and training stability.
📝 Abstract
Conditional decision generation with diffusion models has shown powerful competitiveness in reinforcement learning (RL). Recent studies reveal the relation between energy-function-guidance diffusion models and constrained RL problems. The main challenge lies in estimating the intermediate energy, which is intractable due to the log-expectation formulation during the generation process. To address this issue, we propose the Analytic Energy-guided Policy Optimization (AEPO). Specifically, we first provide a theoretical analysis and the closed-form solution of the intermediate guidance when the diffusion model obeys the conditional Gaussian transformation. Then, we analyze the posterior Gaussian distribution in the log-expectation formulation and obtain the target estimation of the log-expectation under mild assumptions. Finally, we train an intermediate energy neural network to approach the target estimation of log-expectation formulation. We apply our method in 30+ offline RL tasks to demonstrate the effectiveness of our method. Extensive experiments illustrate that our method surpasses numerous representative baselines in D4RL offline reinforcement learning benchmarks.