Equinox: Holistic Fair Scheduling in Serving Large Language Models

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In LLM service scheduling, balancing user fairness—measured by weighted latency and token allocation—with system efficiency—measured by throughput and GPU utilization—remains challenging; moreover, key performance metrics are only observable post-execution, leading to scheduling paradoxes. Method: This paper proposes a Dual-Counter Fair Scheduling framework featuring: (i) a tunable, unified fairness scoring system integrating user-perceived latency and resource utilization; (ii) MoPE, a deterministic expert prediction model that accurately estimates first-token latency and throughput prior to scheduling; and (iii) adaptive batching coupled with non-blocking scheduling. Results: Evaluated on real and synthetic workloads, the framework achieves 1.3× higher throughput, 60% lower first-token latency, 13% improved fairness, and 94% GPU utilization compared to VTC, demonstrating strong cross-platform robustness and effectiveness.

Technology Category

Application Category

📝 Abstract
We address the limitations of current LLM serving with a dual-counter framework separating user and operator perspectives. The User Fairness Counter measures quality of service via weighted tokens and latency; the Resource Fairness Counter measures operational efficiency through throughput and GPU utilization. Since these metrics are only available post-execution, creating a scheduling paradox, we introduce a deterministic Mixture of Prediction Experts (MoPE) framework to predict user-perceived latency, output tokens, throughput, and GPU utilization. These predictions enable calculation of a unified Holistic Fairness score that balances both counters through tunable parameters for proactive fairness-aware scheduling. We implement this in Equinox, an open-source system with other optimizations like adaptive batching, and stall-free scheduling. Evaluations on production traces (ShareGPT, LMSYS) and synthetic workloads demonstrate Equinox achieves up to $1.3 imes$ higher throughput, 60% lower time-to-first-token latency, and 13% higher fairness versus VTC while maintaining 94% GPU utilization, proving fairness under bounded discrepancy across heterogeneous platforms.
Problem

Research questions and friction points this paper is trying to address.

Addressing LLM serving limitations with dual-counter fairness framework
Predicting post-execution metrics via Mixture of Prediction Experts
Achieving holistic fairness through tunable proactive scheduling parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-counter framework for fairness
Mixture of Prediction Experts forecasting
Holistic Fairness score balancing metrics
🔎 Similar Papers
No similar papers found.
Z
Zhixiang Wei
Shanghai Jiao Tong University
J
James Yen
Shanghai Jiao Tong University
J
Jingyi Chen
Shanghai Jiao Tong University
Z
Ziyang Zhang
Shanghai Jiao Tong University
Z
Zhibai Huang
Shanghai Jiao Tong University
C
Chen Chen
Shanghai Jiao Tong University
X
Xingzi Yu
Shanghai Jiao Tong University
Yicheng Gu
Yicheng Gu
Aalto University
Speech and Singing Voice SynthesisAudio-Visual GenerationDigital Audio Effects
Chenggang Wu
Chenggang Wu
Shanghai Jiao Tong University
Y
Yun Wang
Shanghai Jiao Tong University
Mingyuan Xia
Mingyuan Xia
McGill University
Mobile Systems and ApplicationsProgram AnalysisStorage SystemsVirtualization
J
Jie Wu
Cloud Computing Research Institute, China Telecom
H
Hao Wang
Stevens Institute of Technology
Zhengwei Qi
Zhengwei Qi
Professor of Computer Science, Shanghai Jiao Tong University
system softwareprogram analysiscloud computing