🤖 AI Summary
Diffusion models for text-to-image generation suffer from “reward hacking”: during inference-time noise optimization, scalar reward scores (e.g., aesthetic quality) improve at the expense of textual fidelity. To address this, we propose MIRA—the first score-function-driven KL proxy regularization method operating directly in image space. By freezing the backbone and optimizing only the noise, MIRA constrains distributional shifts along the denoising trajectory, fundamentally mitigating reward hacking. We further extend it to MIRA-DPO, enabling optimization of non-differentiable preference-based rewards. This training-free, inference-time alignment framework achieves >60% win rates across diverse reward metrics on both Stable Diffusion v1.5 and SDXL—significantly outperforming baselines—while inducing negligible distributional drift. MIRA is the first method to jointly guarantee high reward scores and strong prompt alignment.
📝 Abstract
Diffusion models excel at generating images conditioned on text prompts, but the resulting images often do not satisfy user-specific criteria measured by scalar rewards such as Aesthetic Scores. This alignment typically requires fine-tuning, which is computationally demanding. Recently, inference-time alignment via noise optimization has emerged as an efficient alternative, modifying initial input noise to steer the diffusion denoising process towards generating high-reward images. However, this approach suffers from reward hacking, where the model produces images that score highly, yet deviate significantly from the original prompt. We show that noise-space regularization is insufficient and that preventing reward hacking requires an explicit image-space constraint. To this end, we propose MIRA (MItigating Reward hAcking), a training-free, inference-time alignment method. MIRA introduces an image-space, score-based KL surrogate that regularizes the sampling trajectory with a frozen backbone, constraining the output distribution so reward can increase without off-distribution drift (reward hacking). We derive a tractable approximation to KL using diffusion scores. Across SDv1.5 and SDXL, multiple rewards (Aesthetic, HPSv2, PickScore), and public datasets (e.g., Animal-Animal, HPDv2), MIRA achieves >60% win rate vs. strong baselines while preserving prompt adherence; mechanism plots show reward gains with near-zero drift, whereas DNO drifts as compute increases. We further introduce MIRA-DPO, mapping preference optimization to inference time with a frozen backbone, extending MIRA to non-differentiable rewards without fine-tuning.