Sustaining Exascale Performance: Lessons from HPL and HPL-MxP on Aurora

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the challenge of sustaining Exascale performance on heterogeneous supercomputing systems under real-world deployment constraints. Conducted on the Aurora system through three rounds of HPL and HPL-MxP benchmarking, the study proposes a suite of optimizations—including deterministic locality-aware resource mapping, explicit CPU-GPU pipelining, mixed-precision co-scheduling, and a hybrid point-to-point/collective communication fault-tolerance strategy—to effectively mitigate large-scale synchronization stalls. Leveraging hardware features such as Intel discrete GPUs, CPU-direct networking, Slingshot-11 interconnects, and AMX acceleration, the approach boosts FP64 performance from 0.585 to 1.01 EF/s and achieves 11.64 EF/s on HPL-MxP, demonstrating an 11.5× speedup over FP64 and validating the efficacy of hardware-software co-design at extreme scale.

Technology Category

Application Category

📝 Abstract

Sustaining exascale performance in production requires engineering choices and operational practices that emerge only under real deployment constraints and demand coordination across system layers. This paper reports experience from three successive campaigns running HPL and HPL-MxP on Aurora, an Intel-based exascale system featuring the first large-scale deployment of Intel discrete GPUs, CPU-attached network interfaces, and the largest production Slingshot-11 interconnect. Aurora progressed from 0.585EF/s on 5,439 nodes to 1.01EF/s on 9,234 nodes in FP64 HPL, while HPL-MxP reached 11.64EF/s, an 11.5x speedup over FP64 enabled by mixed-precision arithmetic and Intel AMX acceleration. We identify and classify by role at production scale the system-level choices that sustained these results, including deterministic locality-aware resource mapping, explicit CPU-GPU pipelining, mixed-precision orchestration, and a hybrid P2P/collective resilience strategy introduced after synchronization stalls at scale. While some observations are Aurora-specific, the broader lessons are likely to apply to tightly coupled heterogeneous systems at extreme scale.

Problem

Research questions and friction points this paper is trying to address.

exascale performance

heterogeneous systems

system-level coordination

mixed-precision computing

large-scale deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed-precision

CPU-GPU pipelining

locality-aware resource mapping