🤖 AI Summary
Aligning few-step generative models remains challenging due to reliance on computable likelihoods, specific solvers, or architectural constraints. This work proposes FAV, a general-purpose alignment framework that reframes alignment as sampling from a reward-weighted distribution, requiring only samples from the generator and a reference distribution without imposing structural restrictions on the model. Built upon sample-based variational inference, FAV amortizes particle updates into generator parameters by integrating Stein variational gradient descent with fixed-point regression. The method demonstrates strong performance across 56 offline and 30 offline-to-online reinforcement learning tasks and successfully fine-tunes diverse backbones—including GANs, rectified flow models, consistency models, and flow-matching architectures—enabling high-resolution text-to-image generation from ImageNet-256 up to 1024².
📝 Abstract
Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV, Few-step Generative Models Alignment via Sample-based Variational Inference, a general alignment framework that requires only sample access to the generator and the reference distribution. We cast alignment as sampling from a reward-tilted distribution anchored to a reference distribution. We leverage Stein Variational Gradient Descent as a sample-based variational inference scheme and amortize its particle updates into the generator parameters via fixed-point regression. We evaluate FAV on two domains: robotics manipulation and image generator alignment. On generative policy alignment for robotic manipulation, FAV outperforms prevailing policy extraction baselines across 56 offline and 30 offline-to-online RL tasks. For image generator alignment, FAV fine-tunes diverse few-step backbones, including GAN, drifting model, consistency models, and flow maps, scaling from ImageNet-$256$ to 1024$^2$ text-to-image synthesis. Code is available at https://github.com/Jaewoopudding/FAV.