🤖 AI Summary
Memory technology scaling exacerbates reliability challenges, while conventional ECC solutions suffer from high cost, power consumption, and limited spatial scalability. This paper proposes SCREME, a scalable and elastic memory design framework that innovatively repurposes low-cost, low-performance redundant chips—naturally emerging from process scaling—as error-correction storage resources, eliminating reliance on high-performance chips. SCREME achieves this through three key mechanisms: DIMM-level heterogeneous performance configuration, shared bandwidth among multiple low-speed ECC chips, and dynamic interconnects leveraging underutilized on-die I/O. Crucially, it integrates seamlessly with existing memory architectures without modification. Experimental evaluation demonstrates that SCREME reduces error-correction storage cost by up to 42%, while preserving full system compatibility, practical deployability, enhanced fault tolerance, and improved scalability.
📝 Abstract
The continuing advancement of memory technology has not only fueled a surge in performance, but also substantially exacerbate reliability challenges. Traditional solutions have primarily focused on improving the efficiency of protection schemes, i.e., Error Correction Codes (ECC), under the assumption that allocating additional memory space for parity data is always expensive and therefore not a scalable solution.
We break the stereotype by proposing an orthogonal approach that provides additional, cost-effective memory space for resilient memory design. In particular, we recognize that ECC chips (used for parity storage) do not necessarily require the same performance level as regular data chips. This offers two-fold benefits: First, the bandwidth originally provisioned for a regular-performance ECC chip can instead be used to accommodate multiple low-performance chips. Second, the cost of ECC chips can be effectively reduced, as lower performance often correlates with lower expense. In addition, we observe that server-class memory chips are often provisioned with ample, yet underutilized I/O resources. This further offers the opportunity to repurpose these resources to enable flexible on-DIMM interconnections. Based on the above two insights, we finally propose SCREME, a scalable memory framework leverages cost-effective, albeit slower, chips -- naturally produced during rapid technology evolution -- to meet the growing reliability demands driven by this evolution.