Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Hybrid Language Models (HLMs) suffer from excessive communication overhead and resource waste due to large-language models (LLMs) verifying high-confidence tokens generated by small-language models (SLMs). To address this, we propose a communication-efficient HLM framework: the SLM generates candidate tokens locally and selectively uploads only truncated vocabulary distributions—when prediction uncertainty (measured by entropy or confidence) exceeds a theoretically derived optimal threshold—for LLM verification. We establish, for the first time, a strong correlation between SLM uncertainty and LLM rejection probability, enabling opportunistic and compression-aware transmission. Our method integrates dynamic vocabulary truncation, theory-driven threshold optimization, and an end-to-cloud collaborative verification mechanism. Experiments show that, compared to standard HLMs, our approach achieves a 206× improvement in token throughput, skips 74.8% of transmissions, attains a 97.4% vocabulary compression rate, and maintains 97.4% accuracy.

Technology Category

Application Category

📝 Abstract
To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206$ imes$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.
Problem

Research questions and friction points this paper is trying to address.

Reduces communication overhead in hybrid language models
Optimizes transmission via uncertainty-aware vocabulary truncation
Improves token throughput while maintaining high accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Opportunistic transmission based on uncertainty thresholds
Truncated vocabulary distributions for high uncertainty cases
Optimal vocabulary compression strategies for efficiency
🔎 Similar Papers
No similar papers found.
S
Seungeun Oh
School of Electrical and Electronic Engineering, Yonsei University, Korea
J
Jinhyuk Kim
School of Electrical and Electronic Engineering, Yonsei University, Korea
Jihong Park
Jihong Park
Associate Professor, SUTD, SMIEEE
Wireless CommunicationsSemantic CommunicationDistributed Machine LearningAI-RAN
Seung-Woo Ko
Seung-Woo Ko
Associate Professor, Inha University
V2Xedge intelligencelocalizationsemantic communications
J
Jinho Choi
School of Electrical and Mechanical Engineering, The University of Adelaide, Australia
T
Tony Q. S. Quek
Information Systems Technology and Design pillar, Singapore University of Technology and Design, Singapore
Seong-Lyun Kim
Seong-Lyun Kim
School of EEE, Yonsei University
Wireless SystemsRadio Resource ManagementNetworked RoboticsAI for Wireless Systems