Lever: Speculative LLM Inference on Smartphones

πŸ“… 2026-05-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

236K/year
πŸ€– AI Summary
This work addresses the challenge of efficiently running large language models on smartphones, which are constrained by limited DRAM capacity and suffer from high I/O latency when relying on flash-based inference due to frequent model accesses during autoregressive decoding. To overcome this, the paper introduces the first adaptation of speculative decoding to mobile devices, featuring an I/O- and compute-aware token tree construction, early-exit–guided branch pruning, and a CPU-NPU cooperative execution strategy. A small draft model resides in DRAM while the full target model remains on flash, enabling batch verification of candidate tokens. Compared to baseline flash-offloading inference and conventional speculative decoding, the proposed approach reduces average inference latency by 2.93Γ— and 1.50Γ—, respectively, substantially narrowing the performance gap between flash-resident and memory-resident models.
πŸ“ Abstract
Large language models (LLMs) are increasingly needed for interactive mobile applications, but high-quality models exceed the limited DRAM available on smartphones. Flash storage can hold larger models, yet flash-backed inference is slow because autoregressive decoding repeatedly invokes the target model and incurs costly I/O. We observe that speculative decoding is a natural fit for this setting: a small draft model can remain in DRAM, while a larger flash-resident target model verifies multiple candidate tokens per invocation. However, existing methods assume server-class accelerators and fail to account for prolonged I/O latency, limited computation parallelism, and irregular speculation execution. We present Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. Lever jointly optimizes the three stages of speculative decoding under mobile constraints. For drafting, it builds token trees using an I/O- and compute-aware gain-cost objective. For verification, it prunes low-value branches through early-exit prediction to reduce target-model computation. For execution, it maps speculation efficiently across mobile CPU-NPU hardware to improve utilization. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
smartphones
flash storage
I/O latency
speculative decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
flash-backed inference
mobile LLM
token tree pruning
CPU-NPU co-execution