Cost-Aware Routing for Efficient Text-To-Image Generation

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between high-fidelity generation and excessive computational cost in diffusion-based text-to-image synthesis, this paper proposes a prompt-aware dynamic multi-model routing framework. Methodologically, it first quantifies prompt complexity and then employs a learned routing policy to dynamically dispatch prompts of varying complexity to the most suitable generative model—such as diffusion models with differing step counts or lightweight distilled variants—enabling on-demand allocation of computational resources. Its key innovation lies in being the first framework to support adaptive, coordinated scheduling of nine heterogeneous pre-trained models within a unified architecture, thereby transcending rigid budget-constrained paradigms. Experiments on COCO and DiffusionDB demonstrate that our method achieves superior average generation quality over all baselines while significantly reducing overall FLOPs. It preserves high fidelity for complex prompts and accelerates inference for simple prompts by over 3×.

Technology Category

Application Category

📝 Abstract
Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity. Each prompt is automatically routed to the most appropriate text-to-image generation function, which may correspond to a distinct number of denoising steps of a diffusion model, or a disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g., distillation, model quantization), our approach achieves the optimal trade-off by learning to reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and employ more economical choices (e.g., small distilled model) for less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone.
Problem

Research questions and friction points this paper is trying to address.

Balancing quality and computational cost in text-to-image generation
Automatically routing prompts to optimal text-to-image models
Reducing computational expense for simpler prompts while maintaining quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic routing based on prompt complexity
Utilizes multiple pre-trained text-to-image models
Optimizes computation-cost and quality trade-off
🔎 Similar Papers
No similar papers found.
Q
Qinchan Li
Tandon School of Engineering, New York University
Kenneth Chen
Kenneth Chen
New York University
Computer GraphicsVision ScienceVirtual RealityComputational DisplaysApplied Perception
C
Changyue Su
Tandon School of Engineering, New York University
Wittawat Jitkrittum
Wittawat Jitkrittum
Google DeepMind
LLM inference efficiencymodel criticismdiffusionRL
Q
Qi Sun
Tandon School of Engineering, New York University
Patsorn Sangkloy
Patsorn Sangkloy
New York University
Computer VisionDeep Learning