MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Diffusion models for text-to-image generation suffer from “reward hacking”: during inference-time noise optimization, scalar reward scores (e.g., aesthetic quality) improve at the expense of textual fidelity. To address this, we propose MIRA—the first score-function-driven KL proxy regularization method operating directly in image space. By freezing the backbone and optimizing only the noise, MIRA constrains distributional shifts along the denoising trajectory, fundamentally mitigating reward hacking. We further extend it to MIRA-DPO, enabling optimization of non-differentiable preference-based rewards. This training-free, inference-time alignment framework achieves >60% win rates across diverse reward metrics on both Stable Diffusion v1.5 and SDXL—significantly outperforming baselines—while inducing negligible distributional drift. MIRA is the first method to jointly guarantee high reward scores and strong prompt alignment.

Technology Category

Application Category

📝 Abstract

Diffusion models excel at generating images conditioned on text prompts, but the resulting images often do not satisfy user-specific criteria measured by scalar rewards such as Aesthetic Scores. This alignment typically requires fine-tuning, which is computationally demanding. Recently, inference-time alignment via noise optimization has emerged as an efficient alternative, modifying initial input noise to steer the diffusion denoising process towards generating high-reward images. However, this approach suffers from reward hacking, where the model produces images that score highly, yet deviate significantly from the original prompt. We show that noise-space regularization is insufficient and that preventing reward hacking requires an explicit image-space constraint. To this end, we propose MIRA (MItigating Reward hAcking), a training-free, inference-time alignment method. MIRA introduces an image-space, score-based KL surrogate that regularizes the sampling trajectory with a frozen backbone, constraining the output distribution so reward can increase without off-distribution drift (reward hacking). We derive a tractable approximation to KL using diffusion scores. Across SDv1.5 and SDXL, multiple rewards (Aesthetic, HPSv2, PickScore), and public datasets (e.g., Animal-Animal, HPDv2), MIRA achieves >60% win rate vs. strong baselines while preserving prompt adherence; mechanism plots show reward gains with near-zero drift, whereas DNO drifts as compute increases. We further introduce MIRA-DPO, mapping preference optimization to inference time with a frozen backbone, extending MIRA to non-differentiable rewards without fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Mitigating reward hacking in inference-time text-to-image alignment

Preventing image deviation from prompts during reward optimization

Enhancing reward scores without compromising prompt adherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

MIRA uses image-space KL regularization to prevent reward hacking

MIRA-DPO enables inference-time preference optimization without fine-tuning

Method constrains output distribution using diffusion score approximation

🔎 Similar Papers

No similar papers found.