π€ AI Summary
This work addresses the challenge of efficiently running large language models on smartphones, which are constrained by limited DRAM capacity and suffer from high I/O latency when relying on flash-based inference due to frequent model accesses during autoregressive decoding. To overcome this, the paper introduces the first adaptation of speculative decoding to mobile devices, featuring an I/O- and compute-aware token tree construction, early-exitβguided branch pruning, and a CPU-NPU cooperative execution strategy. A small draft model resides in DRAM while the full target model remains on flash, enabling batch verification of candidate tokens. Compared to baseline flash-offloading inference and conventional speculative decoding, the proposed approach reduces average inference latency by 2.93Γ and 1.50Γ, respectively, substantially narrowing the performance gap between flash-resident and memory-resident models.
π Abstract
Large language models (LLMs) are increasingly needed for interactive mobile applications, but high-quality models exceed the limited DRAM available on smartphones. Flash storage can hold larger models, yet flash-backed inference is slow because autoregressive decoding repeatedly invokes the target model and incurs costly I/O. We observe that speculative decoding is a natural fit for this setting: a small draft model can remain in DRAM, while a larger flash-resident target model verifies multiple candidate tokens per invocation. However, existing methods assume server-class accelerators and fail to account for prolonged I/O latency, limited computation parallelism, and irregular speculation execution.
We present Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. Lever jointly optimizes the three stages of speculative decoding under mobile constraints. For drafting, it builds token trees using an I/O- and compute-aware gain-cost objective. For verification, it prunes low-value branches through early-exit prediction to reduce target-model computation. For execution, it maps speculation efficiently across mobile CPU-NPU hardware to improve utilization. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference.