🤖 AI Summary
This work addresses the challenges of limited hardware resources and high computational overhead during the prefill phase in on-device large language model (LLM) inference by proposing an adaptive key-value (KV) cache loading framework. The framework synergistically combines streaming KV cache transmission from the cloud with local on-device computation, leveraging cost-model-based KV block scheduling, cloud-edge collaborative inference, overlapping execution paths, and runtime dynamic policy adaptation to continuously balance communication and computation costs under varying network and device conditions. Experimental results across diverse LLMs, datasets, and edge devices demonstrate that the proposed approach reduces time-to-first-token latency by 1.3–5.1× and per-request energy consumption by 1.5–3.3×, while incurring negligible degradation in output quality.
📝 Abstract
Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We present SparKV, an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. SparKV models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV further refines offline-generated schedules at runtime to rebalance communication and computation costs. Experiments across diverse datasets, LLMs, and edge devices show that SparKV reduces Time-to-First-Token by 1.3$x-5.1x with negligible impact on response quality, while lowering per-request energy consumption by 1.5x to 3.3x, demonstrating its robustness and practicality for real-world on-device deployment.