🤖 AI Summary
FP8’s cross-hardware efficacy and energy efficiency for large language model (LLM) inference remain poorly characterized, particularly across diverse accelerator architectures.
Method: We present the first end-to-end, hardware-based empirical evaluation framework for FP8, deployed on NVIDIA H100 and Intel Gaudi 2 accelerators. The framework enables joint measurement of accuracy, throughput, and power consumption, with fine-grained operator-level analysis—including scaling and accumulation—of E4M3 and E5M2 formats.
Contribution/Results: We demonstrate that FP8 is fundamentally a quantization scheme rather than a standard floating-point representation. Gaudi 2 achieves significantly higher throughput-per-watt under FP8, confirming its superior energy efficiency for datacenter-scale LLM serving. Our study establishes critical empirical benchmarks for hardware-aware compiler optimization, system-level FP8 deployment, and future FP8 standardization efforts.
📝 Abstract
The introduction of 8-bit floating-point (FP8) computation units in modern AI accelerators has generated significant interest in FP8-based large language model (LLM) inference. Unlike 16-bit floating-point formats, FP8 in deep learning requires a shared scaling factor. Additionally, while E4M3 and E5M2 are well-defined at the individual value level, their scaling and accumulation methods remain unspecified and vary across hardware and software implementations. As a result, FP8 behaves more like a quantization format than a standard numeric representation. In this work, we provide the first comprehensive analysis of FP8 computation and acceleration on two AI accelerators: the NVIDIA H100 and Intel Gaudi 2. Our findings highlight that the Gaudi 2, by leveraging FP8, achieves higher throughput-to-power efficiency during LLM inference, offering valuable insights into the practical implications of FP8 adoption for datacenter-scale LLM serving.