Compact LLM Deployment and World Model Assisted Offloading in Mobile Edge Computing

📅 2026-02-14

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This work addresses the trade-offs among latency, energy consumption, and generation quality in deploying large language models (LLMs) within mobile edge computing environments. To this end, the authors propose a joint framework that integrates model compression with inference offloading optimization. A lightweight edge-adapted LLM is constructed through structured pruning, low-bit quantization, and knowledge distillation. Furthermore, they introduce, for the first time, a world model–enhanced proximal policy optimization (World Model-PPO) algorithm to enable efficient offloading decisions under dynamic network conditions. Experimental results demonstrate that the proposed approach reduces model size by 70–80%, cuts per-query energy consumption by 50%, and decreases inference latency by 12–30%, all while satisfying accuracy and low-hallucination constraints. Additionally, the World Model-PPO algorithm achieves a 50% improvement in convergence speed.

Technology Category

Application Category

📝 Abstract

This paper investigates compact large language model (LLM) deployment and world-model-assisted inference offloading in mobile edge computing (MEC) networks. We first propose an edge compact LLM deployment (ECLD) framework that jointly applies structured pruning, low-bit quantization, and knowledge distillation to construct edge-deployable LLM variants, and we evaluate these models using four complementary metrics: accessibility, energy consumption, hallucination rate, and generalization accuracy. Building on the resulting compact models, we formulate an MEC offloading optimization problem that minimizes the long-term average inference latency subject to per-device energy budgets and LLM-specific quality-of-service constraints on effective accuracy and hallucination. To solve this problem under unknown and time-varying network dynamics, we develop a world model-proximal policy optimization (PPO) algorithm, which augments an on-policy PPO algorithm with a learned recurrent world model that provides improved value targets and short imagination rollouts. Extensive experiments on Llama-3.1-8B, Qwen3-8B, and Mistral-12B show that ECLD compresses base models by about 70-80% in storage (i.e., from 15.3 GB to 3.3 GB for Llama-3.1-8B) and reduces per-query energy consumption by up to 50%, while largely preserving accuracy and often lowering hallucination compared with quantization-only or pruning-only baselines. Moreover, they also show that world model-PPO speeds up convergence by about 50%, improves the final reward by 15.8% over vanilla PPO, and reduces average inference latency by 12-30% across different user populations, while satisfying the accuracy and hallucination constraints and approaching the generation quality of always-offloading with much of the efficiency of local execution.

Problem

Research questions and friction points this paper is trying to address.

compact LLM deployment

inference offloading

mobile edge computing

energy consumption

hallucination rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

compact LLM deployment

world model-assisted offloading

structured pruning and quantization