D-Rex: Heterogeneity-Aware Reliability Framework and Adaptive Algorithms for Distributed Storage

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inefficient erasure coding deployment in heterogeneous distributed storage—caused by significant disparities in node capacity, I/O performance, and failure rates—this paper proposes D-Rex, a dynamic adaptive scheduling framework. D-Rex introduces three key contributions: (1) a dynamic erasure coding parameter selection and data block mapping mechanism tailored to heterogeneous resources; (2) a dual-mode scheduler supporting both load balancing (LB) and strict reliability guarantee (SC); and (3) two greedy algorithms—GreedyMinStorage and GreedyLeastUsed—that jointly optimize storage overhead, encoding/decoding cost, and user-specified reliability targets. Experimental results demonstrate that D-Rex increases average stored data volume by 45% over state-of-the-art approaches; GreedyLeastUsed further improves storage volume by 21% while enhancing throughput. All components significantly outperform existing methods in efficiency, adaptability, and reliability-aware resource utilization.

Technology Category

Application Category

📝 Abstract
The exponential growth of data necessitates distributed storage models, such as peer-to-peer systems and data federations. While distributed storage can reduce costs and increase reliability, the heterogeneity in storage capacity, I/O performance, and failure rates of storage resources makes their efficient use a challenge. Further, node failures are common and can lead to data unavailability and even data loss. Erasure coding is a common resiliency strategy implemented in storage systems to mitigate failures by striping data across storage locations. However, erasure coding is computationally expensive and existing systems do not consider the heterogeneous resources and their varied capacity and performance when placing data chunks. We tackle the challenges of using erasure coding with distributed and heterogeneous nodes, aiming to store as much data as possible, minimize encoding and decoding time, and meeting user-defined reliability requirements for each data item. We propose two new dynamic scheduling algorithms, D-Rex LB and D-Rex SC, that adaptively choose erasure coding parameters and map chunks to heterogeneous nodes. D-Rex SC achieves robust performance for both storage utilization and throughput, at a higher computational cost, while D-Rex LB is faster but with slightly less competitive performance. In addition, we propose two greedy algorithms, GreedyMinStorage and GreedyLeastUsed, that optimize for storage utilization and load balancing, respectively. Our experimental evaluation shows that our dynamic schedulers store, on average, 45% more data items without significantly degrading I/O throughput compared to state-of-the-art algorithms, while GreedyLeastUsed is able to store 21% more data items while also increasing throughput.
Problem

Research questions and friction points this paper is trying to address.

Optimizing erasure coding for heterogeneous distributed storage nodes
Minimizing encoding/decoding time while meeting reliability requirements
Balancing storage utilization and throughput in dynamic scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneity-aware adaptive erasure coding algorithms
Dynamic scheduling for optimal storage and throughput
Greedy algorithms for load balancing optimization
🔎 Similar Papers
No similar papers found.
M
Maxime Gonthier
University of Chicago, Argonne National Laboratory
D
Dante D. Sanchez-Gallegos
Universidad Carlos III de Madrid
Haochen Pan
Haochen Pan
University of Chicago
Distributed SystemsCloud Computing
Bogdan Nicolae
Bogdan Nicolae
Argonne National Laboratory
High Performance ComputingAIParallel and Distributed SystemsStorageResilience
S
Sicheng Zhou
Southern University of Science and Technology
H
H. Nguyen
University of Chicago, Argonne National Laboratory
V
Valérie Hayot-Sasson
University of Chicago, Argonne National Laboratory
J
J. G. Pauloski
University of Chicago
J
J. Carretero
Universidad Carlos III de Madrid
Kyle Chard
Kyle Chard
University of Chicago and Argonne National Laboratory
computer sciencedistributed systemshigh performance computingscientific computing
I
Ian Foster
University of Chicago, Argonne National Laboratory