🤖 AI Summary
To address the rapidly escalating communication and memory overheads in full-parameter fine-tuning of large language models (LLMs) under federated learning, this paper proposes a 1-bit gradient compression framework. It introduces a novel “seed-sign pair” representation for gradients, integrating zeroth-order optimization, cross-device shared pseudorandom number generation (PRNG), and sign-based quantization—requiring clients to upload/download only 1-bit information per parameter. We theoretically establish exponential convergence (O(e⁻ᵗ)) and prove inherent robustness to data heterogeneity and Byzantine failures. Extensive experiments across model sizes from 11M to 13B parameters demonstrate performance on par with first- and zeroth-order baselines, while reducing communication costs by 2–4 orders of magnitude and compressing memory overhead to inference-level requirements. To our knowledge, this is the first framework enabling efficient, robust, and resource-light full-parameter federated fine-tuning in privacy-sensitive settings.
📝 Abstract
Federated fine-tuning (FFT) attempts to fine-tune a pre-trained model with private data from distributed clients by exchanging models rather than data under the orchestration of a parameter server (PS). To overcome the bottleneck forged by the growing communication and memory overhead for clients in such systems due to the growing model sizes, we propose extit{FeedSign}, an FFT algorithm in which the upload and download payload for an aggregation step is exactly $1$ bit per step, while the memory overhead is squeezed to the amount needed for inference. This is realized by utilizing zeroth-order (ZO) optimizers on large models and shared pseudo-random number generators (PRNG) across devices to represent the gradient estimates as seed-sign pairs. We conduct theoretical analysis on FeedSign and show that it converges at an exponential rate $mathcal{O}(e^{-t})$, where $t$ is the number of elapsed steps under widely used assumptions. Moreover, FeedSign is found to be robust against data heterogeneity and Byzantine attacks. We conducted extensive experiments on models across different structures and sizes (11M to 13B) and found that the proposed method performs better or closely, depending on scenarios, compared to its ZO and FO counterparts, albeit with an orders-of-magnitude lower communication overhead. We also discuss some interesting advantages as byproducts guaranteed by the minimalistic design of extit{FeedSign}.