Faster Inference of LLMs using FP8 on the Intel Gaudi

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the lack of systematic analysis of FP8 low-precision inference deployment on Intel Gaudi AI accelerators. We present the first comprehensive characterization of FP8 quantization implementation on the Gaudi2 platform, establishing both operator-level and end-to-end performance models, and proposing hardware-aware optimization techniques. By deeply tuning Matrix Fabric Unit (MFU) utilization and co-designing quantization strategies, we achieve >90% MFU utilization across mainstream large language models, significantly improving inference throughput while maintaining end-to-end accuracy degradation below 1%. Our approach thus achieves a practical balance between high throughput and minimal precision loss. This study fills a critical gap in empirical understanding and mechanistic analysis of FP8 inference on commercial AI accelerators, providing a reproducible and transferable methodology for low-precision LLM inference.

Technology Category

Application Category

📝 Abstract

Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in commercially available neural network accelerators, a comprehensive exposition of its underlying mechanisms, along with rigorous performance and accuracy evaluations, is still lacking. In this work, we contribute in three significant ways. First, we analyze the implementation details and quantization options associated with FP8 for inference on the Intel Gaudi AI accelerator. Second, we empirically quantify the throughput improvements afforded by the use of FP8 at both the operator level and in end-to-end scenarios. Third, we assess the accuracy impact of various FP8 quantization methods. Our experimental results indicate that the Intel Gaudi 2 accelerator consistently achieves high computational unit utilization, frequently exceeding 90% MFU, while incurring an accuracy degradation of less than 1%.

Problem

Research questions and friction points this paper is trying to address.

Analyze FP8 implementation on Intel Gaudi for inference.

Quantify throughput improvements using FP8 in AI tasks.

Assess accuracy impact of FP8 quantization methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

FP8 implementation on Intel Gaudi AI accelerator

Throughput improvements quantified at operator level

Accuracy impact assessed for FP8 quantization methods

🔎 Similar Papers

No similar papers found.