🤖 AI Summary
This work addresses the server bandwidth bottleneck in hierarchical distributed learning by introducing a gradient coding architecture incorporating relay nodes. Under asynchronous and adversarial settings—where stragglers and Byzantine workers are present—we establish, for the first time, the information-theoretically optimal trade-off between communication and computation in hierarchical systems. We propose a linear coding scheme that jointly models computation delays and failures across both worker-to-relay and relay-to-server links, achieving simultaneous optimality in communication load at both ends. Theoretically, our scheme guarantees exact recovery of the global gradient while minimizing end-to-end communication overhead—thereby breaking the bandwidth limitations inherent in conventional single-layer gradient coding. This yields the first information-theoretically optimal solution for bandwidth-efficient, highly fault-tolerant distributed learning.
📝 Abstract
In this paper, we study gradient coding in a hierarchical setting, where there are intermediate nodes between the server and the workers. This structure reduces the bandwidth requirements at the server, which is a bottleneck in conventional gradient coding systems. In this paper, the intermediate nodes, referred to as $ extit{relays}$, process the data received from workers and send the results to the server for the final gradient computation. Our main contribution is deriving the optimal communication-computation trade-off by designing a linear coding scheme inspired by coded computing techniques, considering straggling and adversarial nodes among both relays and workers. The processing of the data in the relays makes it possible to achieve both the relay-to-server and the worker-to-relay communication loads simultaneously optimal with regard to the computation load.