๐ค AI Summary
Deploying large language models (LLMs) on edge devices faces fundamental challenges including constrained computational resources, limited memory capacity, and hardware heterogeneity. To address these, this paper systematically surveys the full lifecycle of edge LLMs and introduces a comprehensive, stack-wide technical taxonomyโspanning lightweight model design (e.g., pruning, quantization, distillation), runtime optimization (e.g., memory-aware inference, device-adaptive scheduling), and on-device deployment (e.g., cloud-edge collaborative frameworks). It proposes a novel cross-platform co-deployment paradigm that unifies pre-deployment model compression with dynamic execution optimization. Based on a rigorous synthesis of over 120 state-of-the-art studies, the work identifies five persistent technical bottlenecks and six key future research directions. The resulting methodology provides both a reusable conceptual framework and practical guidelines for deploying AI at the edge.
๐ Abstract
Large language models (LLMs) have revolutionized natural language processing with their exceptional capabilities. However, deploying LLMs on resource-constrained edge devices presents significant challenges due to computational limitations, memory constraints, and edge hardware heterogeneity. This survey summarizes recent developments in edge LLMs across their lifecycle, examining resource-efficient designs from pre-deployment techniques to runtime optimizations. Additionally, it explores on-device LLM applications in personal, enterprise, and industrial scenarios. By synthesizing advancements and identifying future directions, this survey aims to provide a comprehensive understanding of state-of-the-art methods for deploying LLMs on edge devices, bridging the gap between their immense potential and edge computing limitations.