MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image models exhibit limited alignment with user preferences despite high diversity, and mainstream reward models are applied only post-hoc for filtering—degrading diversity, introducing semantic distortions, and impairing training efficiency. Method: We propose the first end-to-end training framework that integrates multiple reward models (e.g., preference, image quality, semantic fidelity) as joint conditional guidance directly into the pretraining stage, optimizing the generative distribution rather than performing retrospective filtering. Contribution/Results: Our approach simultaneously enhances image quality, text–image alignment, and generation diversity while significantly accelerating convergence. It achieves state-of-the-art performance on authoritative benchmarks—including GenEval, PickAScore, ImageReward, and HPSv2—demonstrating both the effectiveness and generalizability of multi-reward collaborative modeling for user preference alignment.

Technology Category

Application Category

📝 Abstract
Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).
Problem

Research questions and friction points this paper is trying to address.

Aligning text-to-image generation with user preferences
Improving quality and efficiency without sacrificing diversity
Conditioning models on multiple rewards during training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditioning model on multiple reward models during training
Learning user preferences directly for image generation
Improving visual quality and accelerating training process