🤖 AI Summary
This work addresses the high computational cost and misalignment between pixel-space rewards and latent generative spaces that plague existing vision-language model (VLM)-based reward mechanisms in diffusion model preference optimization. To overcome these limitations, we propose DiNa-LRM, the first latent reward model natively aligned with the diffusion process, which performs preference learning directly on noisy latent states. Key innovations include noise-calibrated Thurstone likelihood modeling to capture preference uncertainty, a timestep-conditioned reward head, and a noise-ensemble mechanism during inference. Experiments demonstrate that DiNa-LRM significantly outperforms existing diffusion-based reward methods on image alignment benchmarks, achieving performance comparable to state-of-the-art VLMs while substantially reducing computational overhead and accelerating preference optimization.
📝 Abstract
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.