Instant GPU Efficiency Visibility at Fleet Scale

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Existing approaches struggle to measure hardware-level computational efficiency across GPU generations and numerical precisions in large-scale AI training in a real-time, non-intrusive manner. This work proposes Overall FLOP Utilization (OFU), which leverages only two on-chip performance counters—Tensor Pipe activity and SM clock frequency—to enable, for the first time, application-instrumentation-free, generation-agnostic, and precision-agnostic GPU efficiency estimation. Through five systematic calibration steps—including GEMM microbenchmarks, tile-level quantization correction, and multi-precision validation—OFU achieves a correlation of r = 0.78 with application-level Model FLOP Utilization (MFU) and a prediction error of no more than 2 percentage points. Deployed in production, OFU has successfully identified framework-level FLOPs accounting bugs, detected efficiency regressions up to 2.5×, and accurately tracked utilization dynamics during mixed-precision pretraining.

📝 Abstract

We present Overall FLOP Utilization (OFU), a hardware-level, precision-agnostic GPU efficiency metric for AI workloads on HPC systems, derived from two on-chip performance counters: Tensor Pipe Activity and SM clock frequency. OFU requires no application instrumentation and works across GPU generations and numeric precisions. We characterize five properties of the OFU approximation -- tile quantization, floating-point precision scaling, clock sampling noise, Tensor Core clock domains, and non-tensor undercounting -- through controlled GEMM experiments on H100 and GB200 across FP16, TF32, FP8, and NVFP4. After tile-quantization correction, OFU predicts application-level MFU to within <=2 percentage points. Against 608 production training jobs, OFU achieves r = 0.78 correlation with application-level MFU and surfaces two framework-level FLOPs miscalculations. Deployed across large-scale GPU fleets, OFU has detected a 2.5x efficiency regression and tracked precision-dependent utilization changes in mixed-precision pretraining. Our evaluation and operational experience suggest OFU is a practical, deployment-ready complement to application-level MFU for continuous fleet-wide efficiency monitoring.

Problem

Research questions and friction points this paper is trying to address.

GPU efficiency

fleet-scale monitoring

FLOP utilization

AI workloads

performance counters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Overall FLOP Utilization

GPU efficiency

performance counters