🤖 AI Summary
This study systematically evaluates the challenge posed by emerging AI accelerators to GPU dominance, with a focus on performance, energy efficiency, and practical usability. Through end-to-end workloads and microbenchmarking of fundamental operators, complemented by fine-grained measurements of latency, throughput, power consumption, and communication overhead in real-world LLM inference scenarios, the analysis comprehensively covers platforms including Cerebras, SambaNova, Groq, Gaudi, and TPUv5e. The work presents the first quantitative assessment of energy-performance trade-offs across varying model scales, batch sizes, and sequence lengths, revealing that optimal hardware selection is highly workload-dependent. While certain accelerators outperform GPUs in specific configurations, they commonly exhibit 10–60% higher idle power consumption, and achieving peak performance is often severely constrained by software stack maturity and programming efficiency.
📝 Abstract
The push for greater efficiency in AI computation has given rise to an array of accelerator architectures that increasingly challenge the GPU's long-standing dominance. In this work, we provide a quantitative view of this evolving landscape of AI accelerators, including the Cerebras CS-3, SambaNova SN-40, Groq, Gaudi, and TPUv5e platforms, and compare against both NVIDIA (A100, H100) and AMD (MI-300X) GPUs. We evaluate key trade-offs in latency, throughput, power consumption, and energy-efficiency across both (i) end-to-end workloads and (ii) benchmarks of individual computational primitives. Notably, we find the optimal hardware platform varies across batch size, sequence length, and model size, revealing a large underlying optimization space. Our analysis includes detailed power measurements across the prefill and decode phases of LLM inference, as well as quantification of the energy cost of communication. We additionally find that Cerebras, SambaNova, and Gaudi have 10-60% higher idle power than NVIDIA and AMD GPUs, emphasizing the importance of high utilization in order to realize promised efficiency gains. Finally, we assess programmability across platforms based on our experiments with real profiled workloads, comparing the compilation times and software stack maturity required to achieve promised performance.