🤖 AI Summary
In diffusion-based alignment, reinforcement learning or direct gradient optimization often leads to reward over-optimization and mode collapse. To address this, we propose a variational Expectation-Maximization (EM) framework that models alignment as an alternating iterative process: the E-step performs test-time search to generate high-reward, diverse samples, while the M-step fine-tunes the model by optimizing a variational lower bound. This approach explicitly balances reward maximization and diversity preservation without compromising generation quality. It unifies treatment across both continuous (e.g., text-to-image) and discrete (e.g., DNA sequence design) generative tasks. Experiments demonstrate that our method significantly mitigates mode collapse across multiple downstream benchmarks, achieving a more robust trade-off between reward and diversity. The results validate both its effectiveness and broad generalizability.
📝 Abstract
Diffusion alignment aims to optimize diffusion models for the downstream objective. While existing methods based on reinforcement learning or direct backpropagation achieve considerable success in maximizing rewards, they often suffer from reward over-optimization and mode collapse. We introduce Diffusion Alignment as Variational Expectation-Maximization (DAV), a framework that formulates diffusion alignment as an iterative process alternating between two complementary phases: the E-step and the M-step. In the E-step, we employ test-time search to generate diverse and reward-aligned samples. In the M-step, we refine the diffusion model using samples discovered by the E-step. We demonstrate that DAV can optimize reward while preserving diversity for both continuous and discrete tasks: text-to-image synthesis and DNA sequence design.