PICO: Performance Insights for Collective Operations

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing high-performance computing (HPC) and large-scale AI training systems lack a fine-grained, reproducible, and scalable performance evaluation framework for collective communication operations. This paper introduces CollectiveBench—the first lightweight benchmarking framework that simultaneously achieves high-precision profiling, strict reproducibility, and cross-platform scalability. It employs a modular architecture and low-overhead performance instrumentation to monitor multi-dimensional metrics—including latency, bandwidth, and topology-aware throughput—and enables automated testing workflows. The framework is extensible to new collective operators and heterogeneous hardware (e.g., GPUs, NPUs, interconnects). Evaluated on mainstream HPC and AI clusters, CollectiveBench improves analysis efficiency significantly, with measurement error under 3% and near-linear scalability up to 1,000 accelerators. It establishes a systematic, infrastructure-level foundation for collective communication optimization.

Technology Category

Application Category

📝 Abstract

Collective operations are cornerstones of both HPC application and large-scale AI training and inference. Yet, comprehensive, systematic and reproducible performance evaluation and benchmarking of said operations is not straightforward. Existing frameworks do not provide sufficiently detailed profiling information, nor they ensure reproducibility and extensibility. In this paper, we present PICO (Performance Insights for Collective Operations), a novel lightweight, extensible framework built with the aim of simplifying collective operations benchmarking.

Problem

Research questions and friction points this paper is trying to address.

Evaluating collective operations performance in HPC and AI systems

Addressing lack of detailed profiling in existing benchmarking frameworks

Ensuring reproducibility and extensibility in performance evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight extensible framework for benchmarking

Simplifies collective operations performance evaluation

Ensures detailed profiling and reproducibility

🔎 Similar Papers

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization

2024-06-07International Symposium on High-Performance Computer ArchitectureCitations: 5

💼 Related Jobs

AI/HPC System Performance Engineer