SafeCiM: Investigating Resilience of Hybrid Floating-Point Compute-in-Memory Deep Learning Accelerators

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Floating-point compute-in-memory (FP-CiM) accelerators suffer from insufficient robustness against hardware faults—especially bit-flip errors—causing severe accuracy degradation in large language model (LLM) inference. Method: This work first systematically quantifies the cascading impact of bit-flip faults across critical datapath components—including digital multipliers, compute-in-memory units, and adder trees—revealing their extreme sensitivity in LLM workloads. We propose SafeCiM, a highly resilient FP-CiM architecture featuring novel pre-alignment optimization and a fault-tolerant adder-tree design to suppress catastrophic accuracy collapse from single-point failures. Results: On a 4096-MAC-scale accelerator, SafeCiM reduces worst-case accuracy degradation under single-fault conditions by up to 49×. Whereas a single adder fault in conventional FP-CiM drives LLM accuracy to zero, SafeCiM significantly enhances system reliability, delivering a practical fault-tolerant solution for mission-critical FP-CiM deployments.

Technology Category

Application Category

📝 Abstract

Deep Neural Networks (DNNs) continue to grow in complexity with Large Language Models (LLMs) incorporating vast numbers of parameters. Handling these parameters efficiently in traditional accelerators is limited by data-transmission bottlenecks, motivating Compute-in-Memory (CiM) architectures that integrate computation within or near memory to reduce data movement. Recent work has explored CiM designs using Floating-Point (FP) and Integer (INT) operations. FP computations typically deliver higher output quality due to their wider dynamic range and precision, benefiting precision-sensitive Generative AI applications. These include models such as LLMs, thus driving advancements in FP-CiM accelerators. However, the vulnerability of FP-CiM to hardware faults remains underexplored, posing a major reliability concern in mission-critical settings. To address this gap, we systematically analyze hardware fault effects in FP-CiM by introducing bit-flip faults at key computational stages, including digital multipliers, CiM memory cells, and digital adder trees. Experiments with Convolutional Neural Networks (CNNs) such as AlexNet and state-of-the-art LLMs including LLaMA-3.2-1B and Qwen-0.3B-Base reveal how faults at each stage affect inference accuracy. Notably, a single adder fault can reduce LLM accuracy to 0%. Based on these insights, we propose a fault-resilient design, SafeCiM, that mitigates fault impact far better than a naive FP-CiM with a pre-alignment stage. For example, with 4096 MAC units, SafeCiM reduces accuracy degradation by up to 49x for a single adder fault compared to the baseline FP-CiM architecture.

Problem

Research questions and friction points this paper is trying to address.

Investigating hardware fault vulnerability in floating-point compute-in-memory accelerators

Analyzing bit-flip fault effects on deep learning inference accuracy

Proposing a resilient design to mitigate fault impact in FP-CiM systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid floating-point compute-in-memory architecture for deep learning

Systematic fault analysis via bit-flip injection at key computational stages

Fault-resilient design SafeCiM with pre-alignment stage for reliability

🔎 Similar Papers

No similar papers found.