🤖 AI Summary
This work addresses the effectiveness evaluation of large language models (LLMs) in automatically porting entire scientific computing and AI codebases across heterogeneous GPGPU programming models (e.g., CUDA, HIP, SYCL). To this end, we introduce the first warehouse-scale HPC code translation benchmark framework, supporting multiple programming models and complexity levels. It establishes the first standardized evaluation paradigm contrasting proxy-based and non-proxy-based translation strategies. Using both open-source and commercial LLMs, we systematically assess generated code for compilability, functional correctness, build-error distribution, and inference overhead. Results indicate that while LLMs achieve viable translation accuracy on small-scale scientific kernels, they exhibit critical bottlenecks in generating robust build systems, resolving cross-file dependencies, and scaling to large, real-world codebases—highlighting fundamental limitations in structural understanding and system-level reasoning.
📝 Abstract
GPGPU architectures have become significantly diverse in recent years, which has led to an emergence of a variety of specialized programming models and software stacks to support them. While portable execution models exist, they still require significant developer effort to port to and optimize for different hardware architectures. Recent advances in large language models (LLMs) can help us reduce some of this programmer burden. In this paper, we present a novel benchmark and testing framework, ParEval-Repo, which can be used to evaluate the efficacy of LLM-based approaches in automatically translating entire codebases across GPGPU execution models. ParEval-Repo includes several scientific computing and AI mini-applications in a range of programming models, and levels of repository complexity. We use ParEval-Repo to evaluate a range of state-of-the-art open-source and commercial LLMs, with both a non-agentic and a top-down agentic approach. We assess code generated by the LLMs and approaches in terms of compilability, functional correctness, categories of build errors, and the cost of translation in terms of the number of inference tokens. Our results demonstrate that LLM translation of scientific applications is feasible for small programs but difficulty with generating functional build systems and cross-file dependencies pose challenges in scaling to larger codebases.