Cerberus: Cross-Layer ECC Co-Design for Robust and Efficient Memory Protection

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

252K/year
🤖 AI Summary
This work addresses the limitations of existing three-layer ECC mechanisms—on-die, link, and system—which are independently designed and consequently suffer from redundancy, coverage gaps, and mutual interference, undermining data reliability in high-density, high-speed DRAM systems. To overcome these challenges, the paper proposes Cerberus, the first architecture based on an “Encode-Once, Decode-Many” principle that enables cross-layer ECC co-design. While preserving the distinct responsibilities of each layer, Cerberus unifies device-, link-, and system-level protection through key innovations: complementary parity and syndrome fusion structures, syndrome reuse, optimized decoding order, and dynamic error-correction budget allocation. These techniques collectively prevent miscorrection amplification and enable selective correction under constrained redundancy. Experimental results demonstrate that Cerberus substantially enhances resilience against burst and marginal errors while reducing redundancy overhead, underscoring the critical value of cross-layer coordination in next-generation memory systems.
📝 Abstract
As DRAM scales to higher density and I/O speeds, ensuring data correctness becomes increasingly difficult. Industry has responded with a three-layer stack: on-die ECC (O-ECC), link ECC (L-ECC), and system ECC (S-ECC). However, these layers have evolved independently, often duplicating redundancy, leaving coverage gaps, and occasionally interfering. We propose Cerberus, a cross-layer ECC co-design that unifies protection across device, link, and system while preserving the native role of each layer. At its core is an Encode-Once, Decode-Many (EODM) architecture: the controller performs a single encoding whose redundancy is reused by L-ECC for immediate write-path detection and retry, by O-ECC for in-device repair on reads, and by S-ECC for strong end-to-end recovery. Cerberus jointly designs complementary parity and syndrome structures, orders decoders, and allocates the correction budget to prevent miscorrection amplification and enable selective correction under tight redundancy constraints. Our evaluations show improved resilience to clustered and peripheral faults while reducing redundant overhead, underscoring the importance of coordinated cross-layer protection for next-generation memory systems, such as custom HBMs.
Problem

Research questions and friction points this paper is trying to address.

DRAM reliability
ECC redundancy
cross-layer protection
memory errors
data integrity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-layer ECC
Encode-Once Decode-Many
Memory reliability
Error correction co-design
Redundancy optimization
🔎 Similar Papers