ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the effectiveness evaluation of large language models (LLMs) in automatically porting entire scientific computing and AI codebases across heterogeneous GPGPU programming models (e.g., CUDA, HIP, SYCL). To this end, we introduce the first warehouse-scale HPC code translation benchmark framework, supporting multiple programming models and complexity levels. It establishes the first standardized evaluation paradigm contrasting proxy-based and non-proxy-based translation strategies. Using both open-source and commercial LLMs, we systematically assess generated code for compilability, functional correctness, build-error distribution, and inference overhead. Results indicate that while LLMs achieve viable translation accuracy on small-scale scientific kernels, they exhibit critical bottlenecks in generating robust build systems, resolving cross-file dependencies, and scaling to large, real-world codebases—highlighting fundamental limitations in structural understanding and system-level reasoning.

Technology Category

Application Category

📝 Abstract
GPGPU architectures have become significantly diverse in recent years, which has led to an emergence of a variety of specialized programming models and software stacks to support them. While portable execution models exist, they still require significant developer effort to port to and optimize for different hardware architectures. Recent advances in large language models (LLMs) can help us reduce some of this programmer burden. In this paper, we present a novel benchmark and testing framework, ParEval-Repo, which can be used to evaluate the efficacy of LLM-based approaches in automatically translating entire codebases across GPGPU execution models. ParEval-Repo includes several scientific computing and AI mini-applications in a range of programming models, and levels of repository complexity. We use ParEval-Repo to evaluate a range of state-of-the-art open-source and commercial LLMs, with both a non-agentic and a top-down agentic approach. We assess code generated by the LLMs and approaches in terms of compilability, functional correctness, categories of build errors, and the cost of translation in terms of the number of inference tokens. Our results demonstrate that LLM translation of scientific applications is feasible for small programs but difficulty with generating functional build systems and cross-file dependencies pose challenges in scaling to larger codebases.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for GPGPU code translation across diverse architectures
Assessing LLM performance in repository-level HPC code porting
Measuring translation feasibility for scientific apps with build challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark suite for LLM HPC translation tasks
Evaluates LLMs on multi-file codebase translation
Tests compilability and functional correctness of outputs
🔎 Similar Papers
No similar papers found.