FedMuon: Accelerating Federated Learning with Matrix Orthogonalization

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Federated learning (FL) suffers from excessive communication rounds and severe client drift under non-IID data. Conventional local optimizers (e.g., SGD, Adam) neglect the geometric structure of weight matrices, amplifying ill-conditioned directions, worsening condition numbers, and impeding convergence. To address this, we propose FedMuon—the first matrix-orthogonalization optimizer tailored for FL. FedMuon explicitly models the geometric structure of weights via local matrix orthogonalization and integrates momentum aggregation with local-global gradient alignment to effectively mitigate client drift under non-IID settings. We theoretically establish linear-speedup convergence without requiring data homogeneity assumptions. Extensive experiments demonstrate that FedMuon significantly reduces communication rounds (averaging 37% fewer) and improves test accuracy (+1.2–2.8%) across language and vision models, outperforming baselines including SGD and AdamW.

Technology Category

Application Category

📝 Abstract

The core bottleneck of Federated Learning (FL) lies in the communication rounds. That is, how to achieve more effective local updates is crucial for reducing communication rounds. Existing FL methods still primarily use element-wise local optimizers (Adam/SGD), neglecting the geometric structure of the weight matrices. This often leads to the amplification of pathological directions in the weights during local updates, leading deterioration in the condition number and slow convergence. Therefore, we introduce the Muon optimizer in local, which has matrix orthogonalization to optimize matrix-structured parameters. Experimental results show that, in IID setting, Local Muon significantly accelerates the convergence of FL and reduces communication rounds compared to Local SGD and Local AdamW. However, in non-IID setting, independent matrix orthogonalization based on the local distributions of each client induces strong client drift. Applying Muon in non-IID FL poses significant challenges: (1) client preconditioner leading to client drift; (2) moment reinitialization. To address these challenges, we propose a novel Federated Muon optimizer (FedMuon), which incorporates two key techniques: (1) momentum aggregation, where clients use the aggregated momentum for local initialization; (2) local-global alignment, where the local gradients are aligned with the global update direction to significantly reduce client drift. Theoretically, we prove that exttt{FedMuon} achieves a linear speedup convergence rate without the heterogeneity assumption, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. Empirically, we validate the effectiveness of FedMuon on language and vision models. Compared to several baselines, FedMuon significantly reduces communication rounds and improves test accuracy.

Problem

Research questions and friction points this paper is trying to address.

Reducing communication rounds in Federated Learning through matrix orthogonalization

Addressing client drift in non-IID settings via momentum aggregation

Improving convergence speed and accuracy for distributed model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Matrix orthogonalization optimizes parameters for faster convergence

Momentum aggregation initializes clients with aggregated global momentum

Local-global alignment reduces client drift in non-IID settings

🔎 Similar Papers

FedPeWS: Personalized Warmup via Subnetworks for Enhanced Heterogeneous Federated Learning