🤖 AI Summary
This work addresses the vulnerability of standard federated learning algorithms to Byzantine nodes, which can cause catastrophic failure. It presents the first unified framework modeling Byzantine-robust distributed optimization as an inexact gradient method subject to both additive and multiplicative errors. Within this framework, two novel algorithms are proposed: one leveraging Nesterov acceleration and the other integrating an optimization similarity assumption with a robust aggregation mechanism. Theoretical analysis establishes that the proposed methods achieve optimal asymptotic error bounds under Byzantine attacks. Experimental results demonstrate that these algorithms significantly reduce the number of communication rounds required for convergence and outperform existing approaches in both robustness and efficiency.
📝 Abstract
Standard federated learning algorithms are vulnerable to adversarial nodes, a.k.a. Byzantine failures. To solve this issue, robust distributed learning algorithms have been developed, which typically replace parameter averaging by robust aggregations. While generic conditions on these aggregations exist to guarantee the convergence of (Stochastic) Gradient Descent (SGD), the analyses remain rather ad-hoc. This hinders the development of more complex robust algorithms, such as accelerated ones. In this work, we show that Byzantine-robust distributed optimization can, under standard generic assumptions, be cast as a general optimization with inexact gradient oracles (with both additive and multiplicative error terms), an active field of research. This allows for instance to directly show that GD on top of standard robust aggregation procedures obtains optimal asymptotic error in the Byzantine setting. Going further, we propose two optimization schemes to speed up the convergence. The first one is a Nesterov-type accelerated scheme whose proof directly derives from accelerated inexact gradient results applied to our formulation. The second one hinges on Optimization under Similarity, in which the server leverages an auxiliary loss function that approximates the global loss. Both approaches allow to drastically reduce the communication complexity compared to previous methods, as we show theoretically and empirically.