An Investigation of FP8 Across Accelerators for LLM Inference

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
FP8’s cross-hardware efficacy and energy efficiency for large language model (LLM) inference remain poorly characterized, particularly across diverse accelerator architectures. Method: We present the first end-to-end, hardware-based empirical evaluation framework for FP8, deployed on NVIDIA H100 and Intel Gaudi 2 accelerators. The framework enables joint measurement of accuracy, throughput, and power consumption, with fine-grained operator-level analysis—including scaling and accumulation—of E4M3 and E5M2 formats. Contribution/Results: We demonstrate that FP8 is fundamentally a quantization scheme rather than a standard floating-point representation. Gaudi 2 achieves significantly higher throughput-per-watt under FP8, confirming its superior energy efficiency for datacenter-scale LLM serving. Our study establishes critical empirical benchmarks for hardware-aware compiler optimization, system-level FP8 deployment, and future FP8 standardization efforts.

Technology Category

Application Category

📝 Abstract
The introduction of 8-bit floating-point (FP8) computation units in modern AI accelerators has generated significant interest in FP8-based large language model (LLM) inference. Unlike 16-bit floating-point formats, FP8 in deep learning requires a shared scaling factor. Additionally, while E4M3 and E5M2 are well-defined at the individual value level, their scaling and accumulation methods remain unspecified and vary across hardware and software implementations. As a result, FP8 behaves more like a quantization format than a standard numeric representation. In this work, we provide the first comprehensive analysis of FP8 computation and acceleration on two AI accelerators: the NVIDIA H100 and Intel Gaudi 2. Our findings highlight that the Gaudi 2, by leveraging FP8, achieves higher throughput-to-power efficiency during LLM inference, offering valuable insights into the practical implications of FP8 adoption for datacenter-scale LLM serving.
Problem

Research questions and friction points this paper is trying to address.

FP8 format
large language model inference
AI accelerators
Innovation

Methods, ideas, or system contributions that make the work stand out.

FP8 format
energy efficiency
AI accelerators