RubricRL: Simple Generalizable Rewards for Text-to-Image Generation

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of interpretability and flexibility in reward mechanisms for text-to-image (T2I) generation, this paper proposes RubricRL: a rubric-based reinforcement learning alignment framework. RubricRL dynamically constructs fine-grained, structured visual rubrics per prompt—e.g., object correctness, attribute accuracy, and OCR fidelity—enabling modular reward design, user-customizable dimension weights, and cross-model generalization. A multimodal judge model (e.g., Qwen-VL-mini) independently scores each rubric dimension, and a prompt-adaptive weighting mechanism integrates these scores into a composite reward signal, which is optimized jointly with policy-gradient algorithms such as GRPO or PPO. Experiments demonstrate that RubricRL significantly improves prompt fidelity, visual detail richness, and architectural generalizability across diverse T2I models—including diffusion and autoregressive architectures—achieving consistent performance gains without architectural modifications or additional training data.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.
Problem

Research questions and friction points this paper is trying to address.

Designing interpretable rewards for text-to-image model alignment
Replacing black-box scalar rewards with structured rubric criteria
Enabling user control over reward dimensions in RL optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

RubricRL uses structured rubric for reward design
It employs multimodal judges for independent criteria evaluation
Dynamic weighting mechanism emphasizes relevant visual dimensions
🔎 Similar Papers
No similar papers found.