CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

224K/year
🤖 AI Summary
This work addresses the challenge of balancing latency, energy consumption, and accuracy in large language model inference within resource-constrained device-edge协同 environments, where existing routing approaches struggle to model dynamic deployment costs. The authors propose CR², a framework that formulates query-level routing as a constrained, cost-sensitive decision problem through a two-stage collaborative architecture comprising a lightweight on-device marginal gate and an edge-based utility selector. Innovatively, CR² integrates a user-specified cost-weighted marginal gating mechanism with a conformal risk control (CRC) calibration procedure to explicitly bound the risk of erroneous acceptance, while relying solely on device-side signals to approximate globally optimal routing decisions. Experiments demonstrate that, at equivalent accuracy levels, CR² reduces normalized deployment cost by up to 16.9% compared to strong baselines and significantly expands the Pareto frontier between accuracy and cost.
📝 Abstract
As large language models (LLMs) move from centralized clouds to mobile edge environments, efficient serving must balance latency, energy consumption, and accuracy under constrained device-edge resources. Query-level routing between lightweight on-device models and stronger edge models provides a flexible mechanism to navigate this trade-off. However, existing routers are designed for centralized cloud settings and optimize token-level costs, failing to capture the dynamic latency and energy overheads in wireless edge deployments. In this paper, we formulate mobile edge LLM routing as a deployment-constrained, cost-aware decision problem, and propose CR^2, a two-stage device-edge routing framework. CR^2 decouples a lightweight on-device margin gate from an edge-side utility selector for deferred queries. The margin gate operates on frozen query embeddings and a user-specified cost weight to predict whether local execution is utility-optimal relative to the best edge alternative under the target operating point. We further introduce a conformal risk control (CRC) calibration procedure that maps each operating point to an acceptance threshold, enabling explicit control of the marginal false-acceptance risk under the full-information utility reference. Experiments on the routing task show that CR^2 closely matches a full-information reference router using only device-side signals before deferral. Compared with strong query-level baselines, CR^2 consistently improves the deployable accuracy-cost Pareto frontier and reduces normalized deployment cost by up to 16.9% at matched accuracy.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
edge computing
cost-aware routing
risk control
device-edge collaboration
Innovation

Methods, ideas, or system contributions that make the work stand out.

cost-aware routing
conformal risk control
device-edge LLM inference
margin gate
Pareto frontier optimization
🔎 Similar Papers
No similar papers found.