ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models

📅 2026-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of high latency, high cost, and resource constraints in deploying large language models (LLMs) across cloud-edge-device协同 environments. The authors propose ConsRoute, a novel framework that introduces fine-grained semantic consistency as a routing supervision signal for the first time. By reusing hidden states from the LLM’s prefilling phase to construct lightweight query representations, ConsRoute dynamically learns adaptive routing thresholds through clustering and Bayesian optimization, jointly optimizing response quality, latency, and cost. Notably, the method requires no additional encoder and achieves over 95% of cloud-level performance while reducing end-to-end latency and inference cost by nearly 40%, significantly outperforming existing routing strategies.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations, avoiding additional encoders or inference passes. Furthermore, these representations are clustered, and Bayesian optimization is employed to learn cluster-specific routing thresholds that dynamically balance quality, latency, and cost under heterogeneous query distributions. Extensive experiments demonstrate that ConsRoute achieves near-cloud performance (>=95%) while reducing end-to-end latency and inference cost by nearly 40%, consistently outperforming existing routing baselines in both response quality and system efficiency.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Cloud-Edge-Device Collaboration
Query Routing
Inference Latency
Resource Constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic consistency
adaptive query routing
cloud-edge-device collaboration
Bayesian optimization
hidden state reuse
🔎 Similar Papers
No similar papers found.
H
Haoyu Qiao
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China and also with the National Key Laboratory of Smart Farm Technologies and Systems, Harbin 150001, China
Hao Zhang
Hao Zhang
哈尔滨工业大学
machine learningdeep learningfederated learning
S
Shanwen Mao
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China and also with the National Key Laboratory of Smart Farm Technologies and Systems, Harbin 150001, China
S
Siyao Cheng
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China and also with the National Key Laboratory of Smart Farm Technologies and Systems, Harbin 150001, China
Jie Liu
Jie Liu
Harbin Institute of Technology
Computer Science and Engineering