LP-Spec: Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture-Dataflow Co-Optimization

📅 2025-08-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory bandwidth and compute limitations in mobile large language model (LLM) inference, this work proposes a dataflow-coordinated in-memory computing (PIM) architecture built upon LPDDR5, specifically optimized for speculative decoding. Existing GEMV-based PIM accelerators suffer from redundant GEMM computations and inefficient draft token management in tree-based speculation. To tackle these issues, we introduce three key innovations: (1) dynamic workload scheduling, (2) a hardware-aware draft token pruning unit, and (3) a fine-grained data redistribution mechanism between DRAM and PIM. These jointly eliminate redundant computation and enable near-data parallel execution. Experimental results demonstrate that our design achieves 13.21× higher throughput, 7.56× better energy efficiency, and a 99.87× reduction in energy-delay product (EDP) compared to both mobile NPUs and state-of-the-art GEMV-accelerated PIM baselines.

Technology Category

Application Category

📝 Abstract
LLM inference on mobile devices faces extraneous challenges due to limited memory bandwidth and computational resources. To address these issues, speculative inference and processing-in-memory (PIM) techniques have been explored at the algorithmic and hardware levels. However, speculative inference results in more compute-intensive GEMM operations, creating new design trade-offs for existing GEMV-accelerated PIM architectures. Furthermore, there exists a significant amount of redundant draft tokens in tree-based speculative inference, necessitating efficient token management schemes to minimize energy consumption. In this work, we present LP-Spec, an architecture-dataflow co-design leveraging hybrid LPDDR5 performance-enhanced PIM architecture with draft token pruning and dynamic workload scheduling to accelerate LLM speculative inference. A near-data memory controller is proposed to enable data reallocation between DRAM and PIM banks. Furthermore, a data allocation unit based on the hardware-aware draft token pruner is developed to minimize energy consumption and fully exploit parallel execution opportunities. Compared to end-to-end LLM inference on other mobile solutions such as mobile NPUs or GEMV-accelerated PIMs, our LP-Spec achieves 13.21x, 7.56x, and 99.87x improvements in performance, energy efficiency, and energy-delay-product (EDP). Compared with prior AttAcc PIM and RTX 3090 GPU, LP-Spec can obtain 12.83x and 415.31x EDP reduction benefits.
Problem

Research questions and friction points this paper is trying to address.

Address limited memory bandwidth in mobile LLM inference
Optimize PIM architecture for compute-intensive GEMM operations
Reduce redundant draft tokens in speculative inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid LPDDR5 PIM architecture for LLM inference
Draft token pruning and dynamic workload scheduling
Near-data memory controller for energy efficiency
🔎 Similar Papers
No similar papers found.
S
Siyuan He
School of Integrated Circuits, Peking University
Z
Zhantong Zhu
School of Integrated Circuits, Peking University
Yandong He
Yandong He
School of Integrated Circuits, Peking University
Tianyu Jia
Tianyu Jia
Assistant Professor, Peking University
VLSI DesignComputer Architecture