🤖 AI Summary
Hybrid Language Models (HLMs) suffer from excessive communication overhead and resource waste due to large-language models (LLMs) verifying high-confidence tokens generated by small-language models (SLMs). To address this, we propose a communication-efficient HLM framework: the SLM generates candidate tokens locally and selectively uploads only truncated vocabulary distributions—when prediction uncertainty (measured by entropy or confidence) exceeds a theoretically derived optimal threshold—for LLM verification. We establish, for the first time, a strong correlation between SLM uncertainty and LLM rejection probability, enabling opportunistic and compression-aware transmission. Our method integrates dynamic vocabulary truncation, theory-driven threshold optimization, and an end-to-cloud collaborative verification mechanism. Experiments show that, compared to standard HLMs, our approach achieves a 206× improvement in token throughput, skips 74.8% of transmissions, attains a 97.4% vocabulary compression rate, and maintains 97.4% accuracy.
📝 Abstract
To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206$ imes$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.