🤖 AI Summary
In heterogeneous wireless networks, federated LoRA fine-tuning of large language models suffers from excessive wall-clock time due to coupled system and data heterogeneity. Method: We establish, for the first time under unbounded gradient assumptions, a convergence bound for federated LoRA; propose an adaptive bandwidth allocation framework that jointly optimizes LoRA rank and client sampling probability to enable coordinated communication-computation resource scheduling. Our approach integrates federated learning, LoRA-based low-rank adaptation, and non-convex optimization theory, supporting independent client sampling and resource-aware scheduling. Contribution/Results: Experiments across multiple models and datasets demonstrate significant end-to-end training acceleration—up to 2.3× faster than state-of-the-art methods—while strictly respecting communication bandwidth and local computational constraints.
📝 Abstract
Federated LoRA has emerged as a promising technique for efficiently fine-tuning large language models (LLMs) on distributed devices by reducing the number of trainable parameters. However, existing approaches often inadequately overlook the theoretical and practical implications of system and data heterogeneity, thereby failing to optimize the overall training efficiency, particularly in terms of wall-clock time. In this paper, we propose an adaptive federated LoRA strategy with independent client sampling to minimize the convergence wall-clock time of federated fine-tuning under both computation and communication heterogeneity. We first derive a new convergence bound for federated LoRA with arbitrary and independent client sampling, notably without requiring the stringent bounded gradient assumption. Then, we introduce an adaptive bandwidth allocation scheme that accounts for heterogeneous client resources and system bandwidth constraints. Based on the derived theory, we formulate and solve a non-convex optimization problem to jointly determine the LoRA sketching ratios and sampling probabilities, aiming to minimize wall-clock convergence time. An efficient and low-complexity algorithm is developed to approximate the solution. Finally, extensive experiments demonstrate that our approach significantly reduces wall-clock training time compared to state-of-the-art methods across various models and datasets.