FuxiShuffle: An Adaptive and Resilient Shuffle Service for Distributed Data Processing on Alibaba Cloud

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance bottleneck of the shuffle phase in distributed data processing, which stems from fine-grained random I/O and network contention, and is exacerbated by the inability of existing systems to adapt to dynamic job characteristics and resource fluctuations, as well as their passive and inefficient fault tolerance mechanisms. To overcome these limitations, we propose and implement a general-purpose shuffle service tailored for hyperscale production environments. Our design innovatively integrates dynamic shuffle mode selection, progress-aware task scheduling, adaptive speculative execution, and a proactive fault tolerance mechanism based on multi-replica shuffling with incremental recovery. Experimental results demonstrate that the proposed system significantly reduces end-to-end job completion time and resource consumption, outperforming baseline approaches in both adaptability and fault recovery efficiency.

Technology Category

Application Category

📝 Abstract
Shuffle exchanges intermediate results between upstream and downstream operators in distributed data processing and is usually the bottleneck due to factors such as small random I/Os and network contention. Several systems have been designed to improve shuffle efficiency, but from our experiences of running ultra-large clusters at Alibaba Cloud MaxCompute platform, we observe that they can not adapt to highly dynamic job characteristics and cluster resource conditions, and their fault tolerance mechanisms are passive and inefficient when failures are inevitable. To tackle their limitations, we design and implement FuxiShuffle as a general data shuffle service for the ultra-large production environment of MaxCompute, featuring good adaptability and efficient failure resilience. Specifically, to achieve good adaptability, FuxiShuffle dynamically selects the shuffle mode based on runtime information, conducts progress-aware scheduling for the downstream workers, and automatically determines the most suitable backup strategy for each shuffle data chunk. To make failure resilience efficient, FuxiShuffle actively ensures data availability with multi-replica failover, prevents memory overflow with careful memory management, and employs an incremental recovery mechanism that does not lose computation progress. Our experiments show that, compared to baseline systems, FuxiShuffle significantly reduces not only end-to-end job completion time but also aggregate resource consumption. Micro experiments suggest that our designs are effective in improving adaptability and failure resilience.
Problem

Research questions and friction points this paper is trying to address.

shuffle
distributed data processing
adaptability
fault tolerance
resource contention
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive shuffle
failure resilience
progress-aware scheduling
multi-replica failover
incremental recovery
🔎 Similar Papers
No similar papers found.
Y
Yuhao Lin
Wuhan University
Zhipeng Tang
Zhipeng Tang
UMass Amherst
J
Jiayan Tong
Alibaba Cloud, Alibaba Group
J
Junqing Xiao
Alibaba Cloud, Alibaba Group
Bin Lu
Bin Lu
Shanghai Jiao Tong University
graph neural networkspatiotemporal data miningAI for ScienceGeoAI
Yuhang Li
Yuhang Li
Yale University
Machine Learning
C
Chao Li
Alibaba Cloud, Alibaba Group
Z
Zhiguo Zhang
Alibaba Cloud, Alibaba Group
Junhua Wang
Junhua Wang
College of Computer Science, Nanjing University of Aeronautics and Astronautics
Vehicular networkEdge computingSDN in 5G
H
Hao Luo
Wuhan University
J
James Cheng
The Chinese University of Hong Kong
C
Chuang Hu
Wuhan University
Jiawei Jiang
Jiawei Jiang
Wuhan University
Machine Learning SystemFederated LearningGraph Learning
Xiao Yan
Xiao Yan
Wuhan University
Systems for Data Processing