🤖 AI Summary
This work addresses the high computational cost of Adjoint Matching (AM) in reward-driven diffusion model fine-tuning, which traditionally requires numerous function evaluations and costly adjoint state simulations. The authors reformulate AM as a stochastic optimal control (SOC) problem featuring linear basis drift and a modified terminal cost, enabling the derivation of Efficient Adjoint Matching (EAM). By restructuring the SOC formulation, EAM eliminates the original computational bottlenecks and permits both deterministic ODE integration in just a few steps and closed-form solutions for adjoint states. Experimental results on standard text-to-image fine-tuning benchmarks demonstrate that EAM achieves up to a 4× speedup in training convergence while matching or surpassing the original AM method in terms of PickScore, ImageReward, and related metrics.
📝 Abstract
Reward fine-tuning has become a common approach for aligning pretrained diffusion and flow models with human preferences in text-to-image generation. Among reward-gradient-based methods, Adjoint Matching (AM) provides a principled formulation by casting reward fine-tuning as a stochastic optimal control (SOC) problem. However, AM inevitably requires a substantial computational cost: it requires (i) stochastic simulation of full generative trajectories under memoryless dynamics, resulting in a large number of function evaluations, and (ii) backward ODE simulation of the adjoint state along each sampled trajectory. In this work, we observe that both bottlenecks are closely tied to the \textit{non-trivial base drift} inherited from the pretrained model. Motivated by this observation, we propose \textbf{Efficient Adjoint Matching (EAM)}, which substantially improves training efficiency by reformulating the SOC problem with a \textit{linear base drift} and a correspondingly modified \textit{terminal cost}. This reformulation removes both sources of inefficiency; it enables training-time sampling with a few-step deterministic ODE solver and yields a closed-form adjoint solution that eliminates backward adjoint simulation. On standard text-to-image reward fine-tuning benchmarks, EAM converges up to 4x faster than AM and matches or surpasses it across various metrics including PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics.