🤖 AI Summary
To address memory bandwidth and compute limitations in mobile large language model (LLM) inference, this work proposes a dataflow-coordinated in-memory computing (PIM) architecture built upon LPDDR5, specifically optimized for speculative decoding. Existing GEMV-based PIM accelerators suffer from redundant GEMM computations and inefficient draft token management in tree-based speculation. To tackle these issues, we introduce three key innovations: (1) dynamic workload scheduling, (2) a hardware-aware draft token pruning unit, and (3) a fine-grained data redistribution mechanism between DRAM and PIM. These jointly eliminate redundant computation and enable near-data parallel execution. Experimental results demonstrate that our design achieves 13.21× higher throughput, 7.56× better energy efficiency, and a 99.87× reduction in energy-delay product (EDP) compared to both mobile NPUs and state-of-the-art GEMV-accelerated PIM baselines.
📝 Abstract
LLM inference on mobile devices faces extraneous challenges due to limited memory bandwidth and computational resources. To address these issues, speculative inference and processing-in-memory (PIM) techniques have been explored at the algorithmic and hardware levels. However, speculative inference results in more compute-intensive GEMM operations, creating new design trade-offs for existing GEMV-accelerated PIM architectures. Furthermore, there exists a significant amount of redundant draft tokens in tree-based speculative inference, necessitating efficient token management schemes to minimize energy consumption. In this work, we present LP-Spec, an architecture-dataflow co-design leveraging hybrid LPDDR5 performance-enhanced PIM architecture with draft token pruning and dynamic workload scheduling to accelerate LLM speculative inference. A near-data memory controller is proposed to enable data reallocation between DRAM and PIM banks. Furthermore, a data allocation unit based on the hardware-aware draft token pruner is developed to minimize energy consumption and fully exploit parallel execution opportunities. Compared to end-to-end LLM inference on other mobile solutions such as mobile NPUs or GEMV-accelerated PIMs, our LP-Spec achieves 13.21x, 7.56x, and 99.87x improvements in performance, energy efficiency, and energy-delay-product (EDP). Compared with prior AttAcc PIM and RTX 3090 GPU, LP-Spec can obtain 12.83x and 415.31x EDP reduction benefits.