🤖 AI Summary
Current evaluations of large models predominantly rely on end-to-end metrics, which obscure the underlying causes of performance variations due to hardware and software configurations. This work proposes the first reproducible, execution-trace-based benchmarking framework that constructs a community-extensible, trace-level evidence ecosystem through fine-grained execution traces, YAML-based workload specifications, and containerized launch scripts. The framework enables in-depth analysis of computational, memory, and communication efficiency. Using this approach, the study systematically quantifies—for the first time—the impact of parallelization strategies, interconnect bandwidth, and framework-level optimizations on training performance. Key findings include: high compute-communication overlap does not necessarily reduce step time; doubling TPU interconnect bandwidth yields significantly greater benefits than on GPUs for small-to-medium workloads; and performance gaps of up to 3× exist between optimal configurations across different frameworks.
📝 Abstract
Evaluative claims about LLM infrastructure -- ``workload X is fastest on hardware Y with software Z'' -- depend on a complex configuration space spanning hardware accelerators, interconnect bandwidth, software frameworks, parallelism plans, and communication libraries. Current infrastructure evaluation benchmarks publish a small set of end-to-end numbers that do not explain why one configuration outperforms another. We present CCL-Bench, a trace-based benchmark that addresses the limitations of existing benchmarks by recording reusable evidence for every ML workload. Each contributed data point in CCL-Bench packages an execution trace, a YAML workload card, and the launch scripts. We have developed a community-extensible toolkit to compute fine-grained compute, memory, and communication efficiency metrics from this evidence. Using CCL-Bench, we surface three claims that summary-statistic benchmarks cannot support: (i) higher compute-communication overlap can coincide with longer training step time and reveal inefficient parallelization choices, (ii) doubling TPU interconnect bandwidth yields a much higher end-to-end improvement in step time than doubling GPU interconnect bandwidth on small and medium workloads, and (iii) the best-tuned configuration on one training framework can run up to 3$\times$ slower than the best-tuned configuration on a peer framework on identical hardware.