🤖 AI Summary
In bandwidth-constrained settings, hybrid language models (HLMs) suffer from high token-level prediction uncertainty, leading to frequent large language model (LLM) offloading and excessive communication overhead. To address this, we propose a communication-efficient federated HLM framework. Our approach features: (1) federated learning–based collaborative optimization of token-level uncertainty thresholds to enable adaptive edge-cloud inference decisions; (2) a semantic similarity–driven peer-to-peer token reuse mechanism to minimize redundant transmissions; and (3) a hierarchical model aggregation strategy that balances convergence speed and privacy preservation. Evaluated on a large-scale news classification task, the framework reduces LLM-related communication volume by over 95%, with less than 0.3% accuracy degradation. It significantly improves communication efficiency, scalability, and practical deployability of edge AI systems.
📝 Abstract
Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers. Unlike traditional end-to-end LLM inference, HLMs reduce latency and communication by invoking LLMs only when local SLM predictions are uncertain, i.e., when token-level confidence is low or entropy is high. However, ambiguous or low-confidence predictions still require frequent offloading to the LLM, leading to significant communication overhead in bandwidth-constrained settings. To address this, we propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL). FedHLM's key innovation lies in collaboratively learning token-level uncertainty thresholds that govern when LLM assistance is needed. Rather than using static or manually tuned thresholds, FedHLM employs FL to optimize these thresholds in a privacy-preserving, distributed manner. Additionally, it leverages embedding-based token representations for Peer-to-Peer (P2P) resolution, enabling clients to reuse tokens inferred by semantically similar peers without engaging the LLM. We further introduce hierarchical model aggregation: edge servers refine local routing policies through client updates, while cross-cluster coordination aligns global decision boundaries. This layered design captures recurring uncertainty patterns, reducing redundant LLM queries. Experiments on large-scale news classification tasks show that FedHLM reduces LLM transmissions by over 95 percent with negligible accuracy loss, making it well-suited for scalable and efficient edge-AI applications.