Argus: Quality-Aware High-Throughput Text-to-Image Inference Serving System

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Diffusion models suffer from high inference latency and limited throughput, hindering their deployment in production-scale text-to-image (T2I) services. Method: This paper proposes PromptGrain, a prompt-granularity adaptive T2I high-throughput inference service system. It introduces a quality-throughput joint optimization framework that dynamically selects models, applies quality-aware approximate execution, accelerates iterative denoising, and performs resource-aware load balancing—enabling prompt-specific approximation strategies under fixed cluster capacity. The system calibrates its decision model using real-world workloads and supports intelligent switching among multiple acceleration techniques. Contribution/Results: Experiments demonstrate that PromptGrain reduces SLO violations by 10×, improves average generation quality by 10%, and increases system throughput by 40% over baseline approaches.

Technology Category

Application Category

📝 Abstract

Text-to-image (T2I) models have gained significant popularity. Most of these are diffusion models with unique computational characteristics, distinct from both traditional small-scale ML models and large language models. They are highly compute-bound and use an iterative denoising process to generate images, leading to very high inference time. This creates significant challenges in designing a high-throughput system. We discovered that a large fraction of prompts can be served using faster, approximated models. However, the approximation setting must be carefully calibrated for each prompt to avoid quality degradation. Designing a high-throughput system that assigns each prompt to the appropriate model and compatible approximation setting remains a challenging problem. We present Argus, a high-throughput T2I inference system that selects the right level of approximation for each prompt to maintain quality while meeting throughput targets on a fixed-size cluster. Argus intelligently switches between different approximation strategies to satisfy both throughput and quality requirements. Overall, Argus achieves 10x fewer latency service-level objective (SLO) violations, 10% higher average quality, and 40% higher throughput compared to baselines on two real-world workload traces.

Problem

Research questions and friction points this paper is trying to address.

Optimizing high-throughput text-to-image inference with computational efficiency

Balancing model approximation to prevent image quality degradation

Dynamically selecting approximation strategies for throughput and quality targets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic model selection for prompt-specific approximation

Quality-aware switching between approximation strategies

Throughput optimization with calibrated quality maintenance

🔎 Similar Papers

No similar papers found.