🤖 AI Summary
Local Reconstruction Codes (LRCs) in wide-stripe erasure coding suffer from structural limitations: enlarged local groups increase single-node repair overhead; frequent multi-node failures trigger costly global repairs; and reliability degrades sharply. Method: This paper proposes Cascaded Parity LRC (CP-LRC), the first LRC design to establish structured dependencies between local and global parity blocks. It decomposes global parity information and embeds it into local parity groups, forming a cascaded parity structure that enables coordinated local–global repair—while preserving MDS fault tolerance. CP-LRC is instantiated via a finite-field coefficient generation framework and cascade-aware repair algorithms, yielding two variants: CP-Azure and CP-Uniform. Contribution/Results: Real-world deployment on Alibaba Cloud demonstrates 41% and 26% reductions in repair time for single- and double-node failures, respectively, significantly improving both repair efficiency and system reliability.
📝 Abstract
Erasure coding with wide stripes is increasingly adopted to reduce storage overhead in large-scale storage systems. However, existing Locally Repairable Codes (LRCs) exhibit structural limitations in this setting: inflated local groups increase single-node repair cost, multi-node failures frequently trigger expensive global repair, and reliability degrades sharply. We identify a key root cause: local and global parity blocks are designed independently, preventing them from cooperating during repair. We present Cascaded Parity LRCs (CP-LRCs), a new family of wide stripe LRCs that embed structured dependency between parity blocks by decomposing a global parity block across all local parity blocks. This creates a cascaded parity group that preserves MDS-level fault tolerance while enabling low-bandwidth single-node and multi-node repairs. We provide a general coefficient-generation framework, develop repair algorithms exploiting cascading, and instantiate the design with CP-Azure and CP-Uniform. Evaluations on Alibaba Cloud show reductions in repair time of up to 41% for single-node failures and 26% for two-node failures.