Making Wide Stripes Practical: Cascaded Parity LRCs for Efficient Repair and High Reliability

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Local Reconstruction Codes (LRCs) in wide-stripe erasure coding suffer from structural limitations: enlarged local groups increase single-node repair overhead; frequent multi-node failures trigger costly global repairs; and reliability degrades sharply. Method: This paper proposes Cascaded Parity LRC (CP-LRC), the first LRC design to establish structured dependencies between local and global parity blocks. It decomposes global parity information and embeds it into local parity groups, forming a cascaded parity structure that enables coordinated local–global repair—while preserving MDS fault tolerance. CP-LRC is instantiated via a finite-field coefficient generation framework and cascade-aware repair algorithms, yielding two variants: CP-Azure and CP-Uniform. Contribution/Results: Real-world deployment on Alibaba Cloud demonstrates 41% and 26% reductions in repair time for single- and double-node failures, respectively, significantly improving both repair efficiency and system reliability.

Technology Category

Application Category

📝 Abstract

Erasure coding with wide stripes is increasingly adopted to reduce storage overhead in large-scale storage systems. However, existing Locally Repairable Codes (LRCs) exhibit structural limitations in this setting: inflated local groups increase single-node repair cost, multi-node failures frequently trigger expensive global repair, and reliability degrades sharply. We identify a key root cause: local and global parity blocks are designed independently, preventing them from cooperating during repair. We present Cascaded Parity LRCs (CP-LRCs), a new family of wide stripe LRCs that embed structured dependency between parity blocks by decomposing a global parity block across all local parity blocks. This creates a cascaded parity group that preserves MDS-level fault tolerance while enabling low-bandwidth single-node and multi-node repairs. We provide a general coefficient-generation framework, develop repair algorithms exploiting cascading, and instantiate the design with CP-Azure and CP-Uniform. Evaluations on Alibaba Cloud show reductions in repair time of up to 41% for single-node failures and 26% for two-node failures.

Problem

Research questions and friction points this paper is trying to address.

Reduces single-node repair cost in wide stripe LRCs

Minimizes expensive global repair for multi-node failures

Enhances reliability while preserving MDS fault tolerance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded parity LRCs embed structured dependency between parity blocks

Decompose global parity across local parity blocks for cooperation

Enable low-bandwidth single-node and multi-node repairs with MDS-level fault tolerance

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Senior Software Engineer, AI Resiliency

Nvidia

The base salary range is 184,000 USD - 287,500 USD. You will also be eligible for equity and benefits.

US, WA, Redmond / US, CA, Santa Clara

Authors to Follow