🤖 AI Summary
Existing high-performance computing (HPC) and large-scale AI training systems lack a fine-grained, reproducible, and scalable performance evaluation framework for collective communication operations. This paper introduces CollectiveBench—the first lightweight benchmarking framework that simultaneously achieves high-precision profiling, strict reproducibility, and cross-platform scalability. It employs a modular architecture and low-overhead performance instrumentation to monitor multi-dimensional metrics—including latency, bandwidth, and topology-aware throughput—and enables automated testing workflows. The framework is extensible to new collective operators and heterogeneous hardware (e.g., GPUs, NPUs, interconnects). Evaluated on mainstream HPC and AI clusters, CollectiveBench improves analysis efficiency significantly, with measurement error under 3% and near-linear scalability up to 1,000 accelerators. It establishes a systematic, infrastructure-level foundation for collective communication optimization.
📝 Abstract
Collective operations are cornerstones of both HPC application and large-scale AI training and inference. Yet, comprehensive, systematic and reproducible performance evaluation and benchmarking of said operations is not straightforward. Existing frameworks do not provide sufficiently detailed profiling information, nor they ensure reproducibility and extensibility. In this paper, we present PICO (Performance Insights for Collective Operations), a novel lightweight, extensible framework built with the aim of simplifying collective operations benchmarking.