SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of large language model (LLM) inference on edge NPUs, where limited on-chip memory capacity and the high energy cost of periodic eDRAM refresh—particularly for transient, error-resilient activations such as Query and Attention Output (QO)—hinder efficiency. The authors propose a lifetime-aware, partitioned eDRAM architecture that exploits the temporal residency and bit-level sensitivity of bfloat16 activations by decoupling storage of sign, exponent, and mantissa bits. A differentiated refresh strategy is applied: refresh is disabled for the mantissa bits of transient QO activations, while a relaxed refresh policy is used for the mantissa bits of persistent KV cache. Evaluated across diverse LLMs and inference scenarios, this approach reduces eDRAM refresh energy by 35% compared to standard refresh baselines, while preserving model accuracy on benchmarks including WikiText-2, PIQA, and ARC-Easy.
📝 Abstract
Large Language Model (LLM) inference on edge Neural Processing Units (NPUs) is fundamentally constrained by limited on-chip memory capacity. Although high-density embedded DRAM (eDRAM) is attractive for storing activation workspaces, its periodic refresh consumes substantial energy. Prior work has primarily focused on reducing off-chip traffic or optimizing refresh for persistent Key-Value (KV) caches, while transient and error-resilient Query and Attention Output (QO) activations are largely overlooked. We propose SHIELD, a lifecycle-aware segmented eDRAM architecture that jointly exploits temporal residency and bit-level sensitivity in bfloat16 (BF16) activations. SHIELD isolates the sign and exponent fields from the mantissa, disables refresh for transient QO mantissas, and applies relaxed refresh to persistent KV mantissas. Across multiple LLMs and inference scenarios, SHIELD reduces eDRAM refresh energy by 35% relative to a standard-refresh baseline while preserving accuracy on WikiText-2, PIQA, and ARC-Easy.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
edge NPU
eDRAM refresh energy
activation memory
energy efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

segmented memory architecture
refresh-aware eDRAM
bit-level sensitivity
transient activation management
energy-efficient LLM inference
🔎 Similar Papers
No similar papers found.