Recovering Hidden Reward in Diffusion-Based Policies

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work proposes EnergyFlow, a framework for efficiently recovering implicit reward functions from diffusion policies without adversarial training. By parameterizing a scalar energy function whose gradient corresponds to the denoising field, the method unifies denoising score matching and inverse reinforcement learning under the principle of maximum entropy optimality. The study establishes, for the first time, that the score function of a diffusion model can recover the gradient of the expert’s soft Q-function. To enhance out-of-distribution generalization and ensure reward identifiability, a conservative vector field constraint is introduced, reducing modeling assumptions. Experiments demonstrate that EnergyFlow achieves state-of-the-art imitation learning performance across multiple manipulation tasks, with extracted rewards significantly outperforming those from adversarial IRL and likelihood-based approaches, and effectively enabling downstream reinforcement learning.

📝 Abstract

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at https://github.com/sotaagi/EnergyFlow.

Problem

Research questions and friction points this paper is trying to address.

reward recovery

diffusion-based policies

inverse reinforcement learning

imitation learning

score matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy-based models

Diffusion policies

Inverse reinforcement learning