Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses low token throughput and high computational overhead in hybrid language model (HLM) architectures, caused by uplink transmission and edge-cloud collaborative inference. We propose an uncertainty-aware edge-cloud inference framework. Its core innovation is the first identification of a linear relationship between the small model’s output entropy and the large model’s rejection probability, enabling a risk-controllable adaptive skipping mechanism: the small model dynamically estimates per-token uncertainty and uploads only tokens exceeding a threshold to the base station’s large model for verification. Integrating uncertainty quantification, speculative inference, and channel-aware scheduling, our method achieves Pareto-optimal trade-offs between accuracy and efficiency—maintaining 97.54% of the large model’s accuracy while reducing uplink transmission and large-model computation overhead by 45.93%, and boosting token throughput to 2.54× that of the baseline HLM.

Technology Category

Application Category

📝 Abstract
This paper studies a hybrid language model (HLM) architecture that integrates a small language model (SLM) operating on a mobile device with a large language model (LLM) hosted at the base station (BS) of a wireless network. The HLM token generation process follows the speculative inference principle: the SLM's vocabulary distribution is uploaded to the LLM, which either accepts or rejects it, with rejected tokens being resampled by the LLM. While this approach ensures alignment between the vocabulary distributions of the SLM and LLM, it suffers from low token throughput due to uplink transmission and the computation costs of running both language models. To address this, we propose a novel HLM structure coined Uncertainty-aware opportunistic HLM (U-HLM), wherein the SLM locally measures its output uncertainty and skips both uplink transmissions and LLM operations for tokens that are likely to be accepted. This opportunistic skipping is enabled by our empirical finding of a linear correlation between the SLM's uncertainty and the LLM's rejection probability. We analytically derive the uncertainty threshold and evaluate its expected risk of rejection. Simulations show that U-HLM reduces uplink transmissions and LLM computations by 45.93%, while achieving up to 97.54% of the LLM's inference accuracy and 2.54$ imes$ faster token throughput than HLM without skipping.
Problem

Research questions and friction points this paper is trying to address.

Hybrid language model combining small and large models
Reduces uplink transmissions and computation costs
Improves token throughput and maintains inference accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid model integrates on-device and remote LLMs.
Uncertainty-aware skipping reduces uplink and computation costs.
Linear correlation between SLM uncertainty and LLM rejection.
🔎 Similar Papers
No similar papers found.