Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the high computational cost and misalignment between pixel-space rewards and latent generative spaces that plague existing vision-language model (VLM)-based reward mechanisms in diffusion model preference optimization. To overcome these limitations, we propose DiNa-LRM, the first latent reward model natively aligned with the diffusion process, which performs preference learning directly on noisy latent states. Key innovations include noise-calibrated Thurstone likelihood modeling to capture preference uncertainty, a timestep-conditioned reward head, and a noise-ensemble mechanism during inference. Experiments demonstrate that DiNa-LRM significantly outperforms existing diffusion-based reward methods on image alignment benchmarks, achieving performance comparable to state-of-the-art VLMs while substantially reducing computational overhead and accelerating preference optimization.

Technology Category

Application Category

📝 Abstract

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

Problem

Research questions and friction points this paper is trying to address.

diffusion models

reward modeling

preference optimization

domain mismatch

computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-native

latent reward modeling

preference optimization