🤖 AI Summary
In wireless edge computing networks, large language model (LLM) inference faces a fundamental trade-off between output quality and latency: offloading simple queries incurs high communication latency, while local lightweight models lack capability for complex tasks. To address this, we propose a dynamic inference routing framework that jointly orchestrates on-device lightweight models and edge-resident powerful LLMs. Our method introduces a novel dual-cost model—designed for both single-turn and multi-turn dialogues—that jointly accounts for semantic complexity (predicted by BERT), communication overhead, KV cache management, and context-aware response quality assessment. This enables fine-grained, adaptive routing decisions per inference step. Evaluated on MMLU, GSM8K, and MT-Bench-101, our framework achieves 5–15% average latency reduction and 10–20% fewer LLM invocations compared to state-of-the-art baselines, while maintaining high-quality outputs and low-latency responsiveness.
📝 Abstract
The integration of wireless communications and Large Language Models (LLMs) is poised to unlock ubiquitous intelligent services, yet deploying them in wireless edge-device collaborative environments presents a critical trade-off between inference quality and end-to-end latency. A fundamental mismatch exists between task complexity and resource allocation: offloading simple queries invites prohibitive latency, while on-device models lack the capacity for demanding computations. To address this challenge, we propose a dynamic, quality-latency aware routing framework that orchestrates inference between a lightweight model on the mobile device and a powerful model on the edge server. Our framework employs two distinct cost models: for single-turn queries, it fuses a BERT-predicted semantic score with communication and computation overheads; for multi-turn dialogues, it further quantifies context-aware costs arising from model switching and KV-cache management. While maintaining full inference quality, extensive experiments demonstrate that our framework cuts average response latency by 5-15% and reduces large model invocations by 10-20% against competitive baselines on MMLU, GSM8K, and MT-Bench-101 benchmarks.