LLM-HPC++: Evaluating LLM-Generated Modern C++ and MPI+OpenMP Codes for Scalable Mandelbrot Set Computation

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Despite growing interest in leveraging large language models (LLMs) for scientific computing, there remains a lack of systematic, reproducible evaluation of their ability to generate correct, robust, and scalable high-performance computing (HPC) code. Method: This work introduces the first multidimensional LLM code quality assessment framework tailored to scientific computing, uniformly evaluating syntactic correctness, runtime robustness, and strong/weak scalability. We empirically validate generated implementations of the Mandelbrot set using modern C++ (C++17/20), OpenMP 5.0, and MPI-3.1 on a multi-node HPC cluster compiled with GCC 11.5.0. Contribution/Results: ChatGPT-4 and -5 achieve >92% syntactic accuracy and demonstrate near-linear weak scaling up to 64 nodes, delivering a peak equivalent computational throughput of 52.3 TFLOPS. This study fills a critical gap by establishing the first comprehensive, reproducible benchmark and evaluation paradigm for LLM-generated parallel HPC code, enabling rigorous assessment of LLMs in scientific computing contexts.

Technology Category

Application Category

📝 Abstract

Parallel programming remains one of the most challenging aspects of High-Performance Computing (HPC), requiring deep knowledge of synchronization, communication, and memory models. While modern C++ standards and frameworks like OpenMP and MPI have simplified parallelism, mastering these paradigms is still complex. Recently, Large Language Models (LLMs) have shown promise in automating code generation, but their effectiveness in producing correct and efficient HPC code is not well understood. In this work, we systematically evaluate leading LLMs including ChatGPT 4 and 5, Claude, and LLaMA on the task of generating C++ implementations of the Mandelbrot set using shared-memory, directive-based, and distributed-memory paradigms. Each generated program is compiled and executed with GCC 11.5.0 to assess its correctness, robustness, and scalability. Results show that ChatGPT-4 and ChatGPT-5 achieve strong syntactic precision and scalable performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for generating scalable HPC code

Assessing correctness and performance of LLM-generated parallel C++

Testing LLMs on Mandelbrot set implementations with MPI+OpenMP

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate C++ code for parallel Mandelbrot computation

Systematically evaluate LLMs on correctness and scalability

ChatGPT models achieve high syntactic precision and performance

🔎 Similar Papers

A Survey on Evaluating Large Language Models in Code Generation Tasks