🤖 AI Summary
This study addresses the lack of systematic evaluation of open-source large language models (LLMs) for multilingual high-performance computing (HPC) code generation. Method: We conduct the first comprehensive assessment of DeepSeek’s capabilities across four representative HPC kernels—conjugate gradient solvers, parallel heat equation solvers, DGEMM, and STREAM triad—in C++, Fortran, Julia, and Python. Code correctness, execution performance, and strong/weak scaling behavior are rigorously evaluated using MPI/OpenMP implementations, multi-scale problem sizes, and profiling via perf, time, and scaling analysis; results are benchmarked against GPT-4. Contribution/Results: DeepSeek generates syntactically correct and functionally viable HPC code across languages. However, it exhibits significantly inferior performance and scalability—particularly under large-scale parallel configurations and large-matrix workloads—compared to GPT-4. This reveals critical capability gaps in current open-source LLMs for production-grade HPC code synthesis, highlighting bottlenecks in parallel algorithm understanding, hardware-aware optimization, and scalability reasoning.
📝 Abstract
Large Language Models (LLMs), such as GPT-4 and DeepSeek, have been applied to a wide range of domains in software engineering. However, their potential in the context of High-Performance Computing (HPC) much remains to be explored. This paper evaluates how well DeepSeek, a recent LLM, performs in generating a set of HPC benchmark codes: a conjugate gradient solver, the parallel heat equation, parallel matrix multiplication, DGEMM, and the STREAM triad operation. We analyze DeepSeek's code generation capabilities for traditional HPC languages like Cpp, Fortran, Julia and Python. The evaluation includes testing for code correctness, performance, and scaling across different configurations and matrix sizes. We also provide a detailed comparison between DeepSeek and another widely used tool: GPT-4. Our results demonstrate that while DeepSeek generates functional code for HPC tasks, it lags behind GPT-4, in terms of scalability and execution efficiency of the generated code.