Balancing Fairness and Performance in Multi-User Spark Workloads with Dynamic Scheduling (extended version)

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address the challenge of simultaneously ensuring fairness and minimizing response time in multi-tenant, long-running Spark clusters, this paper proposes a dynamic scheduling framework. The method introduces three key innovations: (1) User-Weighted Fair Queuing (UWFQ), which enforces fairness at the user level—not the job level—by allocating resources proportionally to user weights; (2) a runtime partitioning mechanism grounded in a virtual fair queue model, integrating job completion time estimation and dynamic task granularity adjustment to mitigate data skew and priority inversion; and (3) fine-grained, adaptive scheduling decisions. Experimental evaluation demonstrates that, compared to Spark’s native scheduler and state-of-the-art fair schedulers, the framework reduces average response time for short jobs by up to 74%, while significantly improving overall system performance under stringent fairness guarantees.

Technology Category

Application Category

📝 Abstract

Apache Spark is a widely adopted framework for large-scale data processing. However, in industrial analytics environments, Spark's built-in schedulers, such as FIFO and fair scheduling, struggle to maintain both user-level fairness and low mean response time, particularly in long-running shared applications. Existing solutions typically focus on job-level fairness which unintentionally favors users who submit more jobs. Although Spark offers a built-in fair scheduler, it lacks adaptability to dynamic user workloads and may degrade overall job performance. We present the User Weighted Fair Queuing (UWFQ) scheduler, designed to minimize job response times while ensuring equitable resource distribution across users and their respective jobs. UWFQ simulates a virtual fair queuing system and schedules jobs based on their estimated finish times under a bounded fairness model. To further address task skew and reduce priority inversions, which are common in Spark workloads, we introduce runtime partitioning, a method that dynamically refines task granularity based on expected runtime. We implement UWFQ within the Spark framework and evaluate its performance using multi-user synthetic workloads and Google cluster traces. We show that UWFQ reduces the average response time of small jobs by up to 74% compared to existing built-in Spark schedulers and to state-of-the-art fair scheduling algorithms.

Problem

Research questions and friction points this paper is trying to address.

Balancing user-level fairness with low job response times in Spark

Addressing dynamic workload adaptability in fair scheduling algorithms

Mitigating task skew and priority inversion in multi-user environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic fair queuing scheduler for user-level fairness

Runtime partitioning to reduce task skew issues

Bounded fairness model minimizing job response times

🔎 Similar Papers

Fair and Efficient Ridesharing: A Dynamic Programming-based Relocation Approach