Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the problem of efficiently aligning pre-trained diffusion models so that their denoising trajectories sample from a reward-weighted target distribution. To this end, the authors formulate diffusion alignment as a sequential Monte Carlo process and propose, for the first time, minimizing the variance of log importance weights as the optimization objective, rather than the conventional KL divergence. This formulation provides a unified interpretation of existing alignment strategies and establishes a new design paradigm that transcends KL-based alignment. Theoretical analysis shows that the variance objective attains its minimum under the target distribution, its gradient coincides with that of KL alignment, and it naturally encompasses and generalizes several existing methods.

Technology Category

Application Category

📝 Abstract

Diffusion alignment adapts pretrained diffusion models to sample from reward-tilted distributions along the denoising trajectory. This process naturally admits a Sequential Monte Carlo (SMC) interpretation, where the denoising model acts as a proposal and reward guidance induces importance weights. Motivated by this view, we introduce Variance Minimisation Policy Optimisation (VMPO), which formulates diffusion alignment as minimising the variance of log importance weights rather than directly optimising a Kullback-Leibler (KL) based objective. We prove that the variance objective is minimised by the reward-tilted target distribution and that, under on-policy sampling, its gradient coincides with that of standard KL-based alignment. This perspective offers a common lens for understanding diffusion alignment. Under different choices of potential functions and variance minimisation strategies, VMPO recovers various existing methods, while also suggesting new design directions beyond KL.

Problem

Research questions and friction points this paper is trying to address.

diffusion alignment

policy optimisation

reward-tilted distribution

variance minimisation

Sequential Monte Carlo

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion alignment

variance minimisation

policy optimisation