Cronus: Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

To address low throughput and high latency in LLM inference on heterogeneous GPU clusters—caused by workload imbalance between the prefill and decoding stages—this paper proposes a dynamic workload allocation framework. Our core innovation is a *partially decoupled prefill architecture*, which enables fine-grained load balancing across high-end and low-end GPUs via dynamic prefill splitting, compute-pipeline overlap, and cross-device collaborative scheduling. Unlike conventional request-level decoupling, our approach decouples *prefill computation only*, eliminating resource underutilization and excessive inter-device communication overhead. We further integrate data and pipeline parallelism optimizations. Experiments across diverse heterogeneous GPU configurations (e.g., A100+L4, V100+T4) demonstrate that our method achieves up to 2.3× higher throughput than baseline approaches, while reducing P99 time-to-first-token (TTFT) and P99 token-batch time (TBT) by 41% and 37%, respectively.

Technology Category

Application Category

📝 Abstract

Efficient LLM inference is critical for real-world applications, especially within heterogeneous GPU clusters commonly found in organizations and on-premise datacenters as GPU architecture rapidly evolves. Current disaggregated prefill strategies, which separate the prefill and decode stages of LLM inference across different GPUs, often suffer from suboptimal performance due to imbalances between GPU capabilities and workload demands. On the other hand, extending conventional data parallelism and pipeline parallelism to heterogeneous setups incurs high inference latencies. To address these challenges, we introduce Cronus, a novel LLM inference system designed to dynamically balance workloads across heterogeneous GPUs using partially disaggregated prefill. Cronus partitions each prefill stage and executes its initial portion on the low-end GPU, while overlapping the remaining prefill and decode stages of earlier requests on the high-end GPU. Extensive evaluations across various high-end and low-end GPU combinations demonstrate that Cronus significantly improves the throughput over disaggregated prefill. It also reduces TTFT P99 and TBT P99 significantly over DP and PP while maintaining similar or better throughput.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM inference in heterogeneous GPU clusters

Addressing workload imbalance in disaggregated prefill strategies

Reducing latency in data and pipeline parallelism setups

Innovation

Methods, ideas, or system contributions that make the work stand out.

Partially disaggregated prefill for workload balance

Dynamic partitioning between low-end and high-end GPUs

Overlapping prefill and decode stages to reduce latency

🔎 Similar Papers

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models