Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Deploying large language models on resource-constrained edge devices faces significant bottlenecks in memory and computational capacity. This work systematically examines the core challenges of edge-based large model inference and presents the first comprehensive synthesis of key technical approaches, including model compression, quantization, hardware-aware co-design, edge computing architectures, and dynamic resource scheduling, thereby establishing a holistic technological landscape for the field. By clarifying the evolutionary trajectory of existing methodologies, this study not only provides a coherent theoretical foundation but also offers practical guidance for achieving efficient, low-latency large language model inference at the edge, while identifying promising directions for future research and real-world applications.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have advanced rapidly, emerging as versatile tools across fields thanks to their exceptional language understanding, generation, and reasoning capabilities. However, performing LLM inference at the network edge remains challenging due to their large memory and compute demands. This survey outlines the challenges specific to LLM edge inference and provides a comprehensive overview of recent progress, covering system architectures, model optimization and deployment, and resource management and scheduling. By synthesizing state-of-the-art techniques and mapping future directions, this survey aims to unlock the potential of LLMs in resource-constrained edge environments.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Edge Inference

Resource Constraints

Memory Demand

Compute Demand

Innovation

Methods, ideas, or system contributions that make the work stand out.

Edge Inference

Large Language Models

Model Optimization