🤖 AI Summary
Deploying large language models on resource-constrained edge devices faces significant bottlenecks in memory and computational capacity. This work systematically examines the core challenges of edge-based large model inference and presents the first comprehensive synthesis of key technical approaches, including model compression, quantization, hardware-aware co-design, edge computing architectures, and dynamic resource scheduling, thereby establishing a holistic technological landscape for the field. By clarifying the evolutionary trajectory of existing methodologies, this study not only provides a coherent theoretical foundation but also offers practical guidance for achieving efficient, low-latency large language model inference at the edge, while identifying promising directions for future research and real-world applications.
📝 Abstract
Large language models (LLMs) have advanced rapidly, emerging as versatile tools across fields thanks to their exceptional language understanding, generation, and reasoning capabilities. However, performing LLM inference at the network edge remains challenging due to their large memory and compute demands. This survey outlines the challenges specific to LLM edge inference and provides a comprehensive overview of recent progress, covering system architectures, model optimization and deployment, and resource management and scheduling. By synthesizing state-of-the-art techniques and mapping future directions, this survey aims to unlock the potential of LLMs in resource-constrained edge environments.