Warming Up for Zeroth-Order Federated Pre-Training with Low Resource Clients

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

In federated learning, low-resource edge devices are often excluded from training due to memory and communication constraints, exacerbating data bias and hindering model generalization. To address this, we propose MeZO-FL—the first zeroth-order optimization algorithm designed for full-cycle federated pre-training. Built upon MeZO’s zeroth-order gradient estimation, MeZO-FL replaces gradient uploads with shared random seeds, drastically reducing communication and memory overhead. It further incorporates a variance-reduction mechanism and a client capability-aware stratification strategy to ensure convergence stability and equitable participation. Experiments across multiple datasets and model architectures demonstrate that MeZO-FL achieves significant accuracy gains (+2.1–4.7%) and improved data utilization under high proportions of resource-constrained clients. Notably, it marks the first successful deployment of a zeroth-order optimizer in federated pre-training, simultaneously advancing efficiency, fairness, and generalization.

Technology Category

Application Category

📝 Abstract

Federated learning enables collaborative model training across numerous edge devices without requiring participants to share data; however, memory and communication constraints on these edge devices may preclude their participation in training. We consider a setting in which a subset of edge devices are below a critical memory or communication threshold required to conduct model updates. Under typical federated optimization algorithms, these devices are excluded from training which renders their data inaccessible and increases system induced bias. We are inspired by MeZO, a zeroth-order method used for memory-efficient fine-tuning. The increased variance inherent to zeroth-order gradient approximations has relegated previous zeroth-order optimizers exclusively to the domain of fine tuning; a limitation we seek to correct. We devise a federated, memory-efficient zeroth-order optimizer, ZOWarmUp that permits zeroth-order training from a random initialization. ZOWarmUp leverages differing client capabilities and careful variance reduction techniques to facilitate participation of under-represented, low-resource clients in model training. Like other federated zeroth-order methods, ZOWarmUp eliminates the need for edge devices to transmit their full gradients to the server and instead relies on only a small set of random seeds, rendering the up-link communication cost negligible. We present experiments using various datasets and model architectures to show that ZOWarmUp is a robust algorithm that can can be applied under a wide variety of circumstances. For systems with a high proportion of edge devices that would otherwise be excluded from training, this algorithm provides access to a greater volume and diversity of data, thus improving training outcomes.

Problem

Research questions and friction points this paper is trying to address.

Enables low-resource clients to participate in federated training

Reduces memory and communication constraints in edge devices

Allows zeroth-order optimization from random initialization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zeroth-order optimizer for memory-efficient federated pre-training

Leverages low-resource clients with variance reduction techniques

Uses random seeds instead of gradients to reduce communication cost

🔎 Similar Papers

Federated Large Language Models: Current Progress and Future Directions