CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

227K/year
🤖 AI Summary
Current evaluations of large models predominantly rely on end-to-end metrics, which obscure the underlying causes of performance variations due to hardware and software configurations. This work proposes the first reproducible, execution-trace-based benchmarking framework that constructs a community-extensible, trace-level evidence ecosystem through fine-grained execution traces, YAML-based workload specifications, and containerized launch scripts. The framework enables in-depth analysis of computational, memory, and communication efficiency. Using this approach, the study systematically quantifies—for the first time—the impact of parallelization strategies, interconnect bandwidth, and framework-level optimizations on training performance. Key findings include: high compute-communication overlap does not necessarily reduce step time; doubling TPU interconnect bandwidth yields significantly greater benefits than on GPUs for small-to-medium workloads; and performance gaps of up to 3× exist between optimal configurations across different frameworks.
📝 Abstract
Evaluative claims about LLM infrastructure -- ``workload X is fastest on hardware Y with software Z'' -- depend on a complex configuration space spanning hardware accelerators, interconnect bandwidth, software frameworks, parallelism plans, and communication libraries. Current infrastructure evaluation benchmarks publish a small set of end-to-end numbers that do not explain why one configuration outperforms another. We present CCL-Bench, a trace-based benchmark that addresses the limitations of existing benchmarks by recording reusable evidence for every ML workload. Each contributed data point in CCL-Bench packages an execution trace, a YAML workload card, and the launch scripts. We have developed a community-extensible toolkit to compute fine-grained compute, memory, and communication efficiency metrics from this evidence. Using CCL-Bench, we surface three claims that summary-statistic benchmarks cannot support: (i) higher compute-communication overlap can coincide with longer training step time and reveal inefficient parallelization choices, (ii) doubling TPU interconnect bandwidth yields a much higher end-to-end improvement in step time than doubling GPU interconnect bandwidth on small and medium workloads, and (iii) the best-tuned configuration on one training framework can run up to 3$\times$ slower than the best-tuned configuration on a peer framework on identical hardware.
Problem

Research questions and friction points this paper is trying to address.

LLM infrastructure
benchmarking
trace-based evaluation
performance analysis
configuration space
Innovation

Methods, ideas, or system contributions that make the work stand out.

trace-based benchmark
LLM infrastructure
execution trace
fine-grained metrics
compute-communication overlap
🔎 Similar Papers
No similar papers found.