LLM4VV: Evaluating Cutting-Edge LLMs for Generation and Evaluation of Directive-Based Parallel Programming Model Compiler Tests

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address trust bottlenecks in large language models (LLMs) for code generation—including hallucination, difficulty in correctness verification, and lack of result interpretability—this paper proposes a generative-discriminative dual-model framework. A generative LLM automatically constructs compiler test cases for instruction-level parallel programming, while a discriminative LLM performs end-to-end verification by integrating formal constraints and logical reasoning. Key innovations include multi-scale model orchestration (7B–70B parameters), structured prompt engineering, automated feedback loops, and a ten-dimensional evaluation metric suite covering functional correctness, boundary robustness, and error detection rate. Experiments demonstrate substantial improvements in test-case quality and generation reliability, outperforming baselines in coverage and defect identification. The framework establishes a novel, interpretable, and formally verifiable paradigm for trustworthy LLM-driven code generation.

Technology Category

Application Category

📝 Abstract
The usage of Large Language Models (LLMs) for software and test development has continued to increase since LLMs were first introduced, but only recently have the expectations of LLMs become more realistic. Verifying the correctness of code generated by LLMs is key to improving their usefulness, but there have been no comprehensive and fully autonomous solutions developed yet. Hallucinations are a major concern when LLMs are applied blindly to problems without taking the time and effort to verify their outputs, and an inability to explain the logical reasoning of LLMs leads to issues with trusting their results. To address these challenges while also aiming to effectively apply LLMs, this paper proposes a dual-LLM system (i.e. a generative LLM and a discriminative LLM) and experiments with the usage of LLMs for the generation of a large volume of compiler tests. We experimented with a number of LLMs possessing varying parameter counts and presented results using ten carefully-chosen metrics that we describe in detail in our narrative. Through our findings, it is evident that LLMs possess the promising potential to generate quality compiler tests and verify them automatically.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for generating parallel programming compiler tests
Addressing hallucinations and trust issues in LLM-generated code
Developing autonomous dual-LLM system for test verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-LLM system for test generation and verification
Automated compiler test generation using multiple LLMs
Ten metrics to evaluate LLM-generated compiler tests
🔎 Similar Papers
No similar papers found.
Z
Zachariah Sollenberger
Computational Research and Programming Lab, University of Delaware, Newark, DE, USA
Rahul Patel
Rahul Patel
PhD Student, University of Toronto
Machine LearningCombinatorial Optimization
S
Saieda Ali Zada
Computational Research and Programming Lab, University of Delaware, Newark, DE, USA
Sunita Chandrasekaran
Sunita Chandrasekaran
Associate Professor, Dept. of CIS, University of Delaware
High Performance ComputingParallel ProgrammingOpenMPOpenACCSupercomputing