🤖 AI Summary
This study addresses the communication efficiency bottleneck in cross-silo federated learning, which arises from large-scale models and geographically distributed deployments, compounded by a lack of systematic guidance in backend selection. The work presents the first comprehensive benchmark evaluating mainstream communication backends—MPI, gRPC, and PyTorch RPC—across varying model sizes and network conditions, assessing both point-to-point and end-to-end performance. Building on these insights, the authors propose a novel hybrid communication scheme that integrates gRPC with Amazon S3. Empirical results demonstrate that this approach achieves up to a 3.8× end-to-end training speedup in real-world network environments, offering a practical and efficient foundation for communication backend selection in cross-silo federated learning scenarios.
📝 Abstract
Federated learning (FL) has emerged as a practical means for privacy-preserving distributed machine learning. FL's versatile design makes it suitable for various training settings, from IoT edge devices in cross-device FL to powerful servers in cross-silo FL. A key consequence of this versatility is the high level of diversity found in the networking configuration of FL applications. Coupled with the rising demand for large-scale models such as large language models, well-informed selection and configuration of communication backends become crucial for ensuring optimal performance in FL systems. This work focuses on cross-silo federated learning, presenting in-depth benchmarks of various communication backends, including MPI, gRPC, and PyTorch RPC. In addition, we introduce gRPC+S3, a hybrid backend designed to overcome the limitations of existing approaches, particularly for transmitting large models across geo-distributed deployments, achieving up to $3.8\times$ end-to-end speedup over gRPC. Our benchmarks examine point-to-point and end-to-end performance for a broad range of model sizes running under realistic network conditions. Our findings provide practical insights for selecting and configuring suitable communication backends tailored to the specific federated learning tasks and network configurations.