Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) on edge devices faces a trade-off: small language models (SLMs) offer low latency and power consumption but yield unreliable responses to complex queries; existing uncertainty-based SLM→LLM routing strategies improve accuracy yet incur excessive invocation overhead and exhibit poor generalization. Method: We propose an uncertainty-aware dynamic routing mechanism that—first—identifies the model’s intrinsic uncertainty distribution as the dominant factor, then designs an instruction-pipelined calibration data construction workflow, and releases a general-purpose hold-out dataset to enhance cross-task generalization. Our method requires only zero-shot calibration data—no new annotations—and initiates optimization immediately. Contribution/Results: Evaluated across 1,500+ experiments, our approach significantly improves both accuracy and cost-efficiency. Moreover, we establish the first comprehensive routing benchmark framework supporting multi-dimensional evaluation (e.g., accuracy, latency, energy, and routing overhead).

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly deployed and democratized on edge devices. To improve the efficiency of on-device deployment, small language models (SLMs) are often adopted due to their efficient decoding latency and reduced energy consumption. However, these SLMs often generate inaccurate responses when handling complex queries. One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM. This follows the principle of"If you lack confidence, seek stronger support"to enhance reliability. Relying on more powerful LLMs is yet effective but increases invocation costs. Therefore, striking a routing balance between efficiency and efficacy remains a critical challenge. Additionally, efficiently generalizing the routing strategy to new datasets remains under-explored. In this paper, we conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings. Our findings highlight: First, uncertainty-correctness alignment in different uncertainty quantification (UQ) methods significantly impacts routing performance. Second, uncertainty distributions depend more on both the specific SLM and the chosen UQ method, rather than downstream data. Building on the insight, we propose a calibration data construction instruction pipeline and open-source a constructed hold-out set to enhance routing generalization on new downstream scenarios. The experimental results indicate calibration data effectively bootstraps routing performance without any new data.
Problem

Research questions and friction points this paper is trying to address.

Improve on-device LLM efficiency via SLM routing.
Balance routing costs between SLMs and stronger LLMs.
Enhance routing strategy generalization to new datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty-based SLM routing
Calibration data construction pipeline
Benchmarking 1500+ settings
🔎 Similar Papers
No similar papers found.