Dispatching Odyssey: Exploring Performance in Computing Clusters under Real-world Workloads

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

278K/year

🤖 AI Summary

This paper investigates the impact of scheduling policies on job response time under realistic datacenter workloads. Leveraging empirical traces from Google’s production cluster, we develop a data-driven simulation framework that integrates G/G queueing theory with hierarchical job- and task-level analysis to systematically evaluate JIQ, LWL, and RR across varying cluster sizes, computational budgets, and load characteristics. Key contributions include: (1) the first quantitative identification of a performance inversion between JIQ and the size-aware policy LWL at the task level; (2) the discovery that multiple policies exhibit an optimal server count—beyond which mean response time degrades—under real-world loads; and (3) the proposal of a novel two-phase dynamic partitioning scheduler based on service-threshold criteria, which significantly reduces average response time on production traces. The study provides interpretable, reproducible theoretical foundations and practical guidance for workload-aware scheduler design.

Technology Category

Application Category

📝 Abstract

Recent workload measurements in Google data centers provide an opportunity to challenge existing models and, more broadly, to enhance the understanding of dispatching policies in computing clusters. Through extensive data-driven simulations, we aim to highlight the key features of workload traffic traces that influence response time performance under simple yet representative dispatching policies. For a given computational power budget, we vary the cluster size, i.e., the number of available servers. A job-level analysis reveals that Join Idle Queue (JIQ) and Least Work Left (LWL) exhibit an optimal working point for a fixed utilization coefficient as the number of servers is varied, whereas Round Robin (RR) demonstrates monotonously worsening performance. Additionally, we explore the accuracy of simple G/G queue approximations. When decomposing jobs into tasks, interesting results emerge; notably, the simpler, non-size-based policy JIQ appears to outperform the more"powerful"size-based LWL policy. Complementing these findings, we present preliminary results on a two-stage scheduling approach that partitions tasks based on service thresholds, illustrating that modest architectural modifications can further enhance performance under realistic workload conditions. We provide insights into these results and suggest promising directions for fully explaining the observed phenomena.

Problem

Research questions and friction points this paper is trying to address.

Evaluating dispatching policies in computing clusters under real workloads

Analyzing performance impact of cluster size and workload features

Exploring task partitioning strategies to enhance scheduling performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-driven simulations analyze workload traffic traces

Compare JIQ, LWL, RR under varying server counts

Two-stage scheduling with service thresholds enhances performance

🔎 Similar Papers

How to Evaluate Distributed Coordination Systems? -- A Survey and Analysis