Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

In wireless edge computing networks, large language model (LLM) inference faces a fundamental trade-off between output quality and latency: offloading simple queries incurs high communication latency, while local lightweight models lack capability for complex tasks. To address this, we propose a dynamic inference routing framework that jointly orchestrates on-device lightweight models and edge-resident powerful LLMs. Our method introduces a novel dual-cost model—designed for both single-turn and multi-turn dialogues—that jointly accounts for semantic complexity (predicted by BERT), communication overhead, KV cache management, and context-aware response quality assessment. This enables fine-grained, adaptive routing decisions per inference step. Evaluated on MMLU, GSM8K, and MT-Bench-101, our framework achieves 5–15% average latency reduction and 10–20% fewer LLM invocations compared to state-of-the-art baselines, while maintaining high-quality outputs and low-latency responsiveness.

Technology Category

Application Category

📝 Abstract

The integration of wireless communications and Large Language Models (LLMs) is poised to unlock ubiquitous intelligent services, yet deploying them in wireless edge-device collaborative environments presents a critical trade-off between inference quality and end-to-end latency. A fundamental mismatch exists between task complexity and resource allocation: offloading simple queries invites prohibitive latency, while on-device models lack the capacity for demanding computations. To address this challenge, we propose a dynamic, quality-latency aware routing framework that orchestrates inference between a lightweight model on the mobile device and a powerful model on the edge server. Our framework employs two distinct cost models: for single-turn queries, it fuses a BERT-predicted semantic score with communication and computation overheads; for multi-turn dialogues, it further quantifies context-aware costs arising from model switching and KV-cache management. While maintaining full inference quality, extensive experiments demonstrate that our framework cuts average response latency by 5-15% and reduces large model invocations by 10-20% against competitive baselines on MMLU, GSM8K, and MT-Bench-101 benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Balancing inference quality and latency in wireless edge LLM deployment

Mismatch between task complexity and resource allocation in edge devices

Dynamic routing for efficient LLM inference on mobile and edge servers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic quality-latency aware routing framework

BERT-predicted semantic score fusion

Context-aware cost quantification for dialogues

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing