🤖 AI Summary
Despite growing interest in leveraging large language models (LLMs) for scientific computing, there remains a lack of systematic, reproducible evaluation of their ability to generate correct, robust, and scalable high-performance computing (HPC) code.
Method: This work introduces the first multidimensional LLM code quality assessment framework tailored to scientific computing, uniformly evaluating syntactic correctness, runtime robustness, and strong/weak scalability. We empirically validate generated implementations of the Mandelbrot set using modern C++ (C++17/20), OpenMP 5.0, and MPI-3.1 on a multi-node HPC cluster compiled with GCC 11.5.0.
Contribution/Results: ChatGPT-4 and -5 achieve >92% syntactic accuracy and demonstrate near-linear weak scaling up to 64 nodes, delivering a peak equivalent computational throughput of 52.3 TFLOPS. This study fills a critical gap by establishing the first comprehensive, reproducible benchmark and evaluation paradigm for LLM-generated parallel HPC code, enabling rigorous assessment of LLMs in scientific computing contexts.
📝 Abstract
Parallel programming remains one of the most challenging aspects of High-Performance Computing (HPC), requiring deep knowledge of synchronization, communication, and memory models. While modern C++ standards and frameworks like OpenMP and MPI have simplified parallelism, mastering these paradigms is still complex. Recently, Large Language Models (LLMs) have shown promise in automating code generation, but their effectiveness in producing correct and efficient HPC code is not well understood. In this work, we systematically evaluate leading LLMs including ChatGPT 4 and 5, Claude, and LLaMA on the task of generating C++ implementations of the Mandelbrot set using shared-memory, directive-based, and distributed-memory paradigms. Each generated program is compiled and executed with GCC 11.5.0 to assess its correctness, robustness, and scalability. Results show that ChatGPT-4 and ChatGPT-5 achieve strong syntactic precision and scalable performance.