🤖 AI Summary
This work investigates how to align pre-trained diffusion models with specific preference rewards during inference while preserving fidelity to their original output distribution. The authors propose distinct algorithmic primitives tailored to different distributional distance constraints: linear exponential tilting sampling under KL divergence bounds and a proximal transport oracle under Wasserstein distance constraints. The study establishes the first systematic characterization of the solvability boundary for reward alignment in diffusion models, demonstrating that the choice of distributional distance fundamentally determines both the required computational primitives and the class of tractable reward functions. Specifically, the KL framework efficiently aligns convex, low-dimensional rewards, whereas the Wasserstein framework accommodates concave or low-dimensional Lipschitz rewards.
📝 Abstract
Inference-time reward alignment asks how to turn a pre-trained diffusion model with base law $p$ into a sampler that favors a reward $r$ while remaining close to $p$. Since there is no canonical distributional distance for this closeness constraint, different choices lead to different "reward-aligned" laws and, just as importantly, different algorithmic problems. We develop a primitive-based approach to reward alignment: rather than assuming arbitrary reward-aligned laws can be sampled, we ask which simple algorithmic primitives suffice to implement alignment for non-trivial reward classes. If closeness is measured in KL distance, the target law is $q(x) \propto p(x) \exp(λ^{-1}r(x))$. For this setting, we show that linear exponential tilts of the form $q(x)\propto p(x)\exp(\langle θ, x \rangle)$ -- which according to recent work [MRR26] can be efficiently sampled from -- are a sufficient primitive for aligning to a very broad class of convex low-dimensional rewards. If closeness is measured in Wasserstein distance, the corresponding primitive is a proximal transport oracle: given $x$, solve $\mbox{argmax}_y \{r(y)- λc(x,y)\}$. This oracle can be efficiently implemented for concave or low-dimensional Lipschitz rewards $r(x)=f(Ax)$. Together, these results illustrate that the choice of distribution distance for alignment affects the computational primitive and the tractable reward class.