DILEMMA: Joint LLM Quantization and Distributed LLM Inference Over Edge Computing Systems

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of deploying large language models (LLMs) with low latency and high performance under resource-constrained edge computing environments, this paper proposes a joint optimization framework for model layer placement and layer-wise quantization. We formulate the co-design of layer-wise quantization and distributed inference as an integer linear programming (ILP) problem—the first such formulation—and integrate layer-aware quantization with knowledge distillation to enable accuracy-controllable model compression. Evaluated on OPT-350, our approach achieves a 12.75% average quantization compression ratio with zero accuracy loss on the SQuAD benchmark and significantly reduces end-to-end inference latency. The key contributions are: (1) an ILP-driven unified optimization framework that jointly optimizes quantization and deployment; and (2) a co-designed strategy for layer-wise quantization and distributed scheduling tailored to heterogeneous edge resources.

Technology Category

Application Category

📝 Abstract
With a recent trend of using Large Language Models (LLMs) for different applications within smart cities, there is a need for pushing these models toward the edge of network while still preserving their performance. Edge Computing (EC) as a physically closer computing resource to the end users can help to reduce the communication delay for serving end users' tasks for LLM-dependent services. However, EC servers have limited capacity in terms of communication, computation, and storage capacity. This paper introduces DILEMMA, a novel framework addressing the challenges of deploying LLMs in EC systems by jointly optimizing layer placement and layer quantization in EC systems. DILEMMA formulates an Integer Linear Programming problem to minimize total inference delay while ensuring acceptable LLM performance levels, leveraging layer-wise quantization and knowledge distillation for LLM performance control. Experimental evaluations on OPT-350 model using the SQuAD dataset demonstrate that DILEMMA achieves a quantization ratio of up to 12.75% while preserving model loss, highlighting its effectiveness in resource-constrained environments.
Problem

Research questions and friction points this paper is trying to address.

Optimizes LLM deployment in edge computing systems
Minimizes inference delay with acceptable performance
Achieves high quantization ratio preserving model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint optimization of layer placement and quantization
Integer Linear Programming for minimizing inference delay
Layer-wise quantization and knowledge distillation techniques
🔎 Similar Papers
No similar papers found.