Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

📅 2025-03-15

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This work addresses the limited inference-time performance gains in text-to-image generation, which traditionally rely on massive sampling and costly model retraining. We propose an inference-time scaling method that requires no model retraining. Its core innovation is the first integration of *in-context self-reflection* into diffusion Transformers (DiTs), enabling active, controllable, few-shot iterative refinement via multi-step in-context example learning, textual feedback encoding, and conditional regeneration—leveraging historical generations and defect feedback. Unlike passive best-of-N sampling, our approach supports adaptive optimization with minimal overhead and maintains compatibility with mainstream diffusion models (e.g., SANA). On the GenEval benchmark, it achieves a +0.19 relative improvement, attaining a new state-of-the-art score of 0.81 using only 20 samples per prompt—surpassing the best-of-N result from a 4.8B-parameter model using 2048 samples.

Technology Category

Application Category

📝 Abstract

The predominant approach to advancing text-to-image generation has been training-time scaling, where larger models are trained on more data using greater computational resources. While effective, this approach is computationally expensive, leading to growing interest in inference-time scaling to improve performance. Currently, inference-time scaling for text-to-image diffusion models is largely limited to best-of-N sampling, where multiple images are generated per prompt and a selection model chooses the best output. Inspired by the recent success of reasoning models like DeepSeek-R1 in the language domain, we introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context reflection capabilities. We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. Instead of passively relying on random sampling and hoping for a better result in a future generation, Reflect-DiT explicitly tailors its generations to address specific aspects requiring enhancement. Experimental results demonstrate that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80, which was obtained using a significantly larger model (SANA-1.5-4.8B) with 2048 samples under the best-of-N approach.

Problem

Research questions and friction points this paper is trying to address.

Improves text-to-image generation via inference-time scaling.

Introduces Reflect-DiT for refining image generations using feedback.

Achieves state-of-the-art performance with fewer samples per prompt.

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context reflection for Diffusion Transformers

Reflect-DiT refines images using textual feedback

Improved performance with fewer generated samples

🔎 Similar Papers

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing