🤖 AI Summary
To address the high communication overhead and heterogeneous architecture adaptation challenges in federated fine-tuning of large language models (LLMs) over wireless networks, this paper proposes a communication-aware knowledge distillation framework. The method integrates three key components: (1) communication-state-driven adaptive Top-k logits sparsification, (2) dynamic logits aggregation—eliminating zero-padding noise—and (3) LoRA-based hidden-layer projection for efficient knowledge transfer under bandwidth constraints. Unlike conventional parameter-sharing or standard federated distillation approaches, our framework significantly alleviates the burden of high-dimensional logits transmission while preserving model consistency and generalization across heterogeneous clients. Experimental results demonstrate that the proposed method reduces communication cost by approximately 50% at comparable model performance, achieving both practical deployability and scalability in resource-constrained wireless federated learning settings.
📝 Abstract
Federated learning (FL) for large language models (LLMs) offers a privacy-preserving scheme, enabling clients to collaboratively fine-tune locally deployed LLMs or smaller language models (SLMs) without exchanging raw data. While parameter-sharing methods in traditional FL models solves number of technical challenges, they still incur high communication overhead and struggle with adapting to heterogeneous model architectures. Federated distillation, a framework for mutual knowledge transfer via shared logits, typically offers lower communication overhead than parameter-sharing methods. However, transmitting logits from LLMs remains challenging for bandwidth-limited clients due to their high dimensionality. In this work, we focus on a federated LLM distillation with efficient communication overhead. To achieve this, we first propose an adaptive Top-k logit selection mechanism, dynamically sparsifying logits according to real-time communication conditions. Then to tackle the dimensional inconsistency introduced by the adaptive sparsification, we design an adaptive logits aggregation scheme, effectively alleviating the artificial and uninformative inputs introduced by conventional zero-padding methods. Finally, to enhance the distillation effect, we incorporate LoRA-adapted hidden-layer projection from LLM into the distillation loss, reducing the communication overhead further while providing richer representation. Experimental results demonstrate that our scheme achieves superior performance compared to baseline methods while effectively reducing communication overhead by approximately 50%.