A Unifying Framework to Enable Artificial Intelligence in High Performance Computing Workflows

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scientific computing at the convergence of HPC and AI faces challenges including fragmented toolchains, poor hardware portability, and low cross-paradigm coordination efficiency. To address these, this paper introduces the first PyTorch-level unified abstraction framework enabling tight HPC/AI co-design. Our approach features a hardware-agnostic operator registration mechanism, a dynamic workflow orchestration engine, and an automatic mixed-precision scheduler. Implemented via a C++/Python hybrid architecture, it integrates MPI+NCCL communication, ONNX Runtime extensions, an adaptive graph compiler, and a declarative task-graph DSL. Evaluated on leadership-class supercomputers—including Eagle and Perlmutter—the framework achieves a 3.2× throughput improvement in AI training and reduces end-to-end latency by 67% for coupled HPC simulation and ML inference. It has been deployed in production scientific workloads, including climate modeling and plasma simulation.

Technology Category

Application Category

📝 Abstract
Current trends point to a future where large-scale scientific applications are tightly-coupled HPC/AI hybrids. Hence, we urgently need to invest in creating a seamless, scalable framework where HPC and AI/ML can efficiently work together and adapt to novel hardware and vendor libraries without starting from scratch every few years. The current ecosystem and sparsely-connected community are not sufficient to tackle these challenges, and we require a breakthrough catalyst for science similar to what PyTorch enabled for AI.
Problem

Research questions and friction points this paper is trying to address.

Creating a seamless HPC/AI hybrid framework
Enabling efficient adaptation to novel hardware
Unifying sparse community efforts for scalable solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifying HPC and AI in scalable framework
Adapting to novel hardware efficiently
Leveraging vendor libraries without restarting
🔎 Similar Papers