Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the prohibitively high per-bit cost of High-Bandwidth Memory (HBM) stemming from stringent on-die reliability requirements—which hinders scalable AI inference deployment—this paper proposes a system-level reliability management framework that offloads error correction from memory dies to the memory controller, enabling a domain-specific ECC architecture. Our approach innovatively integrates large-codeword Reed–Solomon coding with fine-grained CRC checking, introduces differential parity updates to reduce write amplification, and supports dynamic error protection based on data importance—making reliability a configurable system parameter. Experimental results demonstrate that, under an uncorrected bit error rate (UBER) of 10⁻³, the system sustains ≥78% throughput and ≥97% model accuracy relative to error-free operation. This design significantly reduces HBM hardware cost while preserving AI inference performance.

Technology Category

Application Category

📝 Abstract

High-Bandwidth Memory (HBM) delivers exceptional bandwidth and energy efficiency for AI workloads, but its high cost per bit, driven in part by stringent on-die reliability requirements, poses a growing barrier to scalable deployment. This work explores a system-level approach to cost reduction by eliminating on-die ECC and shifting all fault management to the memory controller. We introduce a domain-specific ECC framework combining large-codeword Reed--Solomon~(RS) correction with lightweight fine-grained CRC detection, differential parity updates to mitigate write amplification, and tunable protection based on data importance. Our evaluation using LLM inference workloads shows that, even under raw HBM bit error rates up to $10^{-3}$, the system retains over 78% of throughput and 97% of model accuracy compared with systems equipped with ideal error-free HBM. By treating reliability as a tunable system parameter rather than a fixed hardware constraint, our design opens a new path toward low-cost, high-performance HBM deployment in AI infrastructure.

Problem

Research questions and friction points this paper is trying to address.

Reducing HBM cost by shifting fault management to memory controller

Introducing domain-specific ECC for AI workloads with high error rates

Maintaining performance and accuracy despite high raw HBM error rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-specific ECC framework for HBM

Reed-Solomon and CRC error correction

Tunable protection based on data importance

🔎 Similar Papers

No similar papers found.