🤖 AI Summary
This work addresses the challenge of achieving both high throughput and low latency in diffusion model inference, which is hindered by resource contention between concurrent UNet and VAE executions in existing continuous batching approaches, leading to severe latency spikes. The authors propose a novel intra- and inter-layer cooperative continuous batching framework: at the intra-layer level, VAE tiling and adaptive CFG skipping alleviate resource bottlenecks; at the inter-layer level, a threshold-aware scheduler balances UNet throughput against VAE latency, augmented by a dynamic feedback controller to optimize scheduling thresholds. By integrating component-level resource optimization with scheduling granularity sensitivity analysis for the first time, the method achieves up to 1.6× higher throughput and reduces average and P99 end-to-end latency by up to 78.7%, all while preserving image fidelity.
📝 Abstract
The expansion of Artificial Intelligence-generated content service requires diffusion model serving to simultaneously achieve high throughput and low task end-to-end (E2E) latency. However, existing continuous batching methods suffer from severe resource contention during UNet-VAE concurrency, leading to latency spikes. Furthermore, concurrent multi-task scheduling entails a trade-off between UNet throughput and VAE latency across varying scheduling strategies. To address these, we propose SynerDiff, an efficient continuous batching system built on intra-inter level synergy. At the intra-concurrency level, SynerDiff alleviates resource contention by pruning component-specific resource bottlenecks via VAE Chunking and Adaptive Skip-CFG. At the inter-concurrency level, leveraging components' differential sensitivity to scheduling granularities, a threshold-aware scheduler plans concurrent sequences and tunes intra-concurrency decisions to minimize VAE latency while maintaining UNet within high-throughput threshold. Additionally, a feedback controller dynamically adjusts this threshold based on queue loads to boost system capacity ceiling. Experimental results show that, SynerDiff improves throughput by 1.6$\times$ and decreases both average E2E and P99 tail latencies by up to 78.7\%, compared to benchmarks while guaranteeing high image fidelity.