UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory bandwidth and capacity bottlenecks during LLM decoding on edge NPU devices, this paper proposes the first NPU–PIM co-computing–oriented column-major tiling data layout and configurable DRAM address mapping mechanism. Our approach resolves three key challenges—data layout mismatch, bandwidth underutilization, and redundant storage—simultaneously and without additional storage overhead, via tile-based columnar data placement, memory affinity optimization, and dynamic address remapping. Evaluated on OPT-family models, it achieves a 3.0× reduction in first-token latency and a 2.18× reduction in last-token latency, significantly improving end-to-end inference throughput. This work establishes a scalable, system-level co-design paradigm for efficient PIM acceleration of edge-deployed LLMs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly deployed on edge devices with Neural Processing Units (NPUs), yet the decode phase remains memory-intensive, limiting performance. Processing-in-Memory (PIM) offers a promising solution, but co-executing NPU-PIM systems face challenges such as data layout mismatches, bandwidth loss, and redundant storage. To address these issues, we propose UMDAM, a unified memory-affinity data layout and DRAM address mapping scheme tailored for NPU-PIM co-execution. UMDAM employs a column-major, tile-based layout and a configurable DRAM mapping strategy to ensure compatibility with NPU computation while maximizing PIM efficiency -- without introducing extra memory overhead or bandwidth loss. Comprehensive evaluations on OPT models demonstrate that UMDAM reduces time-to-first-token (TTFT) by up to 3.0x and time-to-last-token (TTLT) by 2.18x, significantly improving end-to-end LLM inference efficiency on edge devices.
Problem

Research questions and friction points this paper is trying to address.

Addresses memory-intensive LLM decode phase on edge NPUs
Solves data layout mismatches in NPU-PIM co-execution systems
Eliminates bandwidth loss and redundant storage in heterogeneous architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified memory-affinity data layout for NPU-PIM co-execution
Column-major tile-based layout with configurable DRAM mapping
Eliminates data layout mismatches without extra memory overhead
🔎 Similar Papers
No similar papers found.
H
Hai Huang
School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China
X
Xuhong Qiang
School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China
Weisheng Zhao
Weisheng Zhao
Fert Beijing Institute, Beihang University
Spintronics Devices and Integrated Circuits
Chenchen Liu
Chenchen Liu
University of Maryland, Baltimore County
High-Performance ComputingDeep LearningBrain-Inspired ComputingEmerging Memory Technologies