An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc

πŸ“… 2026-03-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing evaluation methods struggle to comprehensively assess critical dimensions of AI-generated high-performance computing (HPC) scientific code beyond functional correctness, such as performance, memory management, algorithmic suitability, and library-specific conventions. This work proposes petscagent-bench, a novel framework that introduces a black-box evaluation mechanism grounded in agent-to-agent communication protocols (A2A/MCP). Leveraging a tool-augmented multi-agent architecture, the framework enables automated assessment of PETSc code generated by any coding model across five key dimensions: correctness, performance, code quality, algorithmic appropriateness, and adherence to library conventions. Empirical analysis reveals that while state-of-the-art large language models produce structurally sound and readable code, they exhibit systematic deficiencies in conforming to PETSc-specific normsβ€”issues largely undetectable by conventional testing approaches.

Technology Category

Application Category

πŸ“ Abstract
While large language models have significantly accelerated scientific code generation, comprehensively evaluating the generated code remains a major challenge. Traditional benchmarks reduce evaluation to test-case matching, an approach insufficient for library code in HPC where solver selection, API conventions, memory management, and performance are just as critical as functional correctness. To address this gap, we introduce petscagent-bench, an agentic framework built on an agents-evaluating-agents paradigm. Instead of relying on static scripts, petscagent-bench deploys a tool-augmented evaluator agent that compiles, executes, and measures code produced by a separate model-under-test agent, orchestrating a 14-evaluator pipeline across five scoring categories: correctness, performance, code quality, algorithmic appropriateness, and library-specific conventions. Because the agents communicate through standardized protocols (A2A and MCP), the framework enables black-box evaluation of any coding agent without requiring access to its source code. We demonstrate the framework on a benchmark suite of realistic problems using the PETSc library for HPC. Our empirical analysis of frontier models reveals that while current models generate readable, well-structured code, they consistently struggle with library-specific conventions that traditional pass/fail metrics completely miss.
Problem

Research questions and friction points this paper is trying to address.

AI-generated code
evaluation framework
scientific computing
HPC
PETSc
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic evaluation
scientific code generation
PETSc
HPC
black-box assessment
πŸ”Ž Similar Papers
No similar papers found.