Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

In bandwidth-constrained settings, hybrid language models (HLMs) suffer from high token-level prediction uncertainty, leading to frequent large language model (LLM) offloading and excessive communication overhead. To address this, we propose a communication-efficient federated HLM framework. Our approach features: (1) federated learning–based collaborative optimization of token-level uncertainty thresholds to enable adaptive edge-cloud inference decisions; (2) a semantic similarity–driven peer-to-peer token reuse mechanism to minimize redundant transmissions; and (3) a hierarchical model aggregation strategy that balances convergence speed and privacy preservation. Evaluated on a large-scale news classification task, the framework reduces LLM-related communication volume by over 95%, with less than 0.3% accuracy degradation. It significantly improves communication efficiency, scalability, and practical deployability of edge AI systems.

Technology Category

Application Category

📝 Abstract

Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers. Unlike traditional end-to-end LLM inference, HLMs reduce latency and communication by invoking LLMs only when local SLM predictions are uncertain, i.e., when token-level confidence is low or entropy is high. However, ambiguous or low-confidence predictions still require frequent offloading to the LLM, leading to significant communication overhead in bandwidth-constrained settings. To address this, we propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL). FedHLM's key innovation lies in collaboratively learning token-level uncertainty thresholds that govern when LLM assistance is needed. Rather than using static or manually tuned thresholds, FedHLM employs FL to optimize these thresholds in a privacy-preserving, distributed manner. Additionally, it leverages embedding-based token representations for Peer-to-Peer (P2P) resolution, enabling clients to reuse tokens inferred by semantically similar peers without engaging the LLM. We further introduce hierarchical model aggregation: edge servers refine local routing policies through client updates, while cross-cluster coordination aligns global decision boundaries. This layered design captures recurring uncertainty patterns, reducing redundant LLM queries. Experiments on large-scale news classification tasks show that FedHLM reduces LLM transmissions by over 95 percent with negligible accuracy loss, making it well-suited for scalable and efficient edge-AI applications.

Problem

Research questions and friction points this paper is trying to address.

Reducing communication overhead in hybrid language models

Optimizing token-level uncertainty thresholds using federated learning

Enhancing edge-AI efficiency with privacy-preserving token reuse

Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated Learning optimizes uncertainty thresholds

P2P token reuse reduces LLM queries

Hierarchical aggregation refines routing policies

🔎 Similar Papers

Federated Large Language Models: Current Progress and Future Directions