EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
This work addresses the cold-start latency bottleneck in large language model inference on mobile devices, which is primarily constrained by flash bandwidth consumption from non-critical parameters. The authors propose an NPU-aware adaptive quantization method that integrates a fine-grained importance-aware multi-precision weight packing format with a CPU-NPU cooperative fine-grained pipeline. This approach significantly enhances cold-start efficiency while preserving model accuracy. Experimental results demonstrate up to a 4.07× reduction in cold-start latency compared to state-of-the-art mobile inference frameworks such as llama.cpp, MNN, and llm.npu, establishing a new performance frontier for on-device LLM deployment.

Technology Category

Application Category

📝 Abstract
Deploying large language models (LLMs) on mobile devices is an emerging trend to enable data privacy and offline accessibility of LLM applications. Modern mobile neural processing units (NPUs) make such deployment increasingly feasible. However, existing mobile LLM inference frameworks suffer from high start-up latency due to their inevitable cold starts, i.e., launching LLM inferences when the model is not hosted in device memory. In this paper, we identify the key bottleneck of mobile LLM cold starts as the waste of flash bandwidth on unimportant model parameters. We design EdgeFlow, a mobile LLM inference framework that mitigates the cold start issue by adaptively adjusting the precisions of LLM parameters. Specifically, EdgeFlow leverages 1) an NPU-aware adaptive quantization algorithm that assigns different precisions to weights in a finer granularity according to their importance and NPU constraints, 2) an SIMD-friendly packing format that accelerates the transformation of various-precision weights into fixed-sized NPU-native data types, and 3) a synergistic granular pipeline that coordinates CPU and NPU computation in a fine-grained and dynamic manner. Experimental results show that EdgeFlow reduces cold-start latency by up to 4.07x compared with three state-of-the-art mobile LLM inference frameworks, i.e., llama.cpp, MNN, and llm.npu, under comparable model accuracy.
Problem

Research questions and friction points this paper is trying to address.

cold start
large language models
mobile devices
inference latency
flash bandwidth
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive quantization
cold start optimization
mobile LLM inference
NPU-aware computation
SIMD-friendly packing