🤖 AI Summary
To address the elasticity bottleneck in deploying large language models (LLMs) on mobile devices—where diverse applications impose conflicting, latency-sensitive service-level objectives (SLOs)—this paper proposes a dual-dimensional dynamic scaling framework jointly optimizing model architecture and prompt processing. Our key contributions are: (1) a one-time neuron reordering technique leveraging Transformer’s inherent permutation equivalence, enabling zero-perception-switching generation of high-quality submodels; (2) a dual-head compact language model architecture supporting fine-grained compute–accuracy trade-offs; and (3) a lightweight prompt refinement and elastic scheduling mechanism for tight prompt–model co-adaptation. Evaluations across multiple commercial smartphones demonstrate that our approach achieves up to 16.83% average accuracy improvement over four state-of-the-art baselines, with <1% token-to-first-token (TTFT) switching overhead, memory footprint comparable to baselines, and offline training requiring less than 100 GPU-hours.
📝 Abstract
On-device Large Language Models (LLMs) are revolutionizing mobile AI, enabling applications such as UI automation while addressing privacy concerns. Currently, the standard approach involves deploying a single, robust LLM as a universal solution for various applications, often referred to as LLM-as-a-Service (LLMaaS). However, this approach faces a significant system challenge: existing LLMs lack the flexibility to accommodate the diverse Service-Level Objectives (SLOs) regarding inference latency across different applications. To address this issue, we introduce ELMS, an on-device LLM service designed to provide elasticity in both the model and prompt dimensions of an LLMaaS. This system includes: A one-time neuron reordering technique, which utilizes the inherent permutation consistency within transformer models to create high-quality, elastic sub-models with minimal runtime switching costs. A dual-head compact language model, which efficiently refines prompts and coordinates the elastic adaptation between the model and the prompt. We have implemented this elastic on-device LLM service on several off-the-shelf (COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent datasets and synthesized end-to-end traces. Across a range of SLOs, ELMS surpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy on average, with less than 1% Time-To-First-Token (TTFT) switching overhead, comparable memory usage, and fewer than 100 offline GPU hours.