🤖 AI Summary
HPC unit testing faces challenges including parallel non-determinism, difficulty in detecting synchronization bugs, and hardware heterogeneity, leading to insufficient coverage with conventional approaches. This paper proposes the first automated test generation framework for HPC based on multi-agent large language models (LLMs). It introduces a novel collaborative architecture comprising a Recipe Agent and a Test Agent, integrated with a critique-feedback loop and dual verification—both compilation and functional correctness—to precisely model OpenMP/MPI parallel structures, communication patterns, and hierarchical concurrency. Experimental evaluation demonstrates that the framework significantly improves test compilability (+32.7%) and functional correctness (+28.4%), successfully uncovering fine-grained synchronization defects and data races missed by traditional tools. The approach thus enhances the reliability and maintainability of HPC software.
📝 Abstract
Unit testing in High-Performance Computing (HPC) is critical but challenged by parallelism, complex algorithms, and diverse hardware. Traditional methods often fail to address non-deterministic behavior and synchronization issues in HPC applications. This paper introduces HPCAgentTester, a novel multi-agent Large Language Model (LLM) framework designed to automate and enhance unit test generation for HPC software utilizing OpenMP and MPI. HPCAgentTester employs a unique collaborative workflow where specialized LLM agents (Recipe Agent and Test Agent) iteratively generate and refine test cases through a critique loop. This architecture enables the generation of context-aware unit tests that specifically target parallel execution constructs, complex communication patterns, and hierarchical parallelism. We demonstrate HPCAgentTester's ability to produce compilable and functionally correct tests for OpenMP and MPI primitives, effectively identifying subtle bugs that are often missed by conventional techniques. Our evaluation shows that HPCAgentTester significantly improves test compilation rates and correctness compared to standalone LLMs, offering a more robust and scalable solution for ensuring the reliability of parallel software systems.