CoRe^2: Collect, Reflect and Refine to Generate Better and Faster

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the longstanding trade-off between generation quality and sampling speed in text-to-image (T2I) synthesis, as well as the lack of a unified, efficient inference framework applicable across heterogeneous model architectures—specifically diffusion models (DMs) and visual autoregressive models (ARMs). We propose CoRe², a three-stage inference paradigm: (1) *Collect*, which gathers classifier-free guidance (CFG) trajectories; (2) *Reflect*, which trains a lightweight reflection model to halve function evaluations; and (3) *Refine*, which employs weak-to-strong conditional guidance to enhance high-frequency details. CoRe² operates without modifying base models and is the first method to uniformly support SDXL, SD3.5, FLUX (DMs), and LlamaGen (visual ARM). It introduces the novel “weak-model reflection + strong-model refinement” collaboration mechanism. Experiments on HPD v2 and Pick-of-Pic benchmarks show consistent gains: PickScore ↑0.3, aesthetic score (AES) ↑0.16, and 5.64-second inference acceleration for SD3.5 per image—while remaining compatible with Z-Sampling.

Technology Category

Application Category

📝 Abstract

Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using SD3.5.Code is released at https://github.com/xie-lab-ml/CoRe/tree/main.

Problem

Research questions and friction points this paper is trying to address.

Enhance text-to-image model speed and quality simultaneously.

Ensure stable performance across diffusion and autoregressive models.

Improve high-frequency and realistic content generation in models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play inference paradigm CoRe^2

Collects and uses CFG trajectories

Employs weak-to-strong guidance refinement

🔎 Similar Papers

No similar papers found.

Authors to Follow