๐ค AI Summary
In scientific computing, automatically generating high-reliability code using large language models (LLMs) remains hindered by domain-specific data scarcity and the practical infeasibility of RLHF within small expert communities. To address this, we propose the first multi-agent code generation framework centered on *unitโphysical consistency*: dimensional analysis and conservation laws are encoded as verifiable unit tests, enabling a primitive-centric collaborative system that suppresses syntactic hallucinations, numerical inaccuracies, and configuration fragility. Our method integrates open-source LLMs, chain-of-thought decoding, and multi-agent coordination to achieve end-to-end scientific code synthesis. Evaluated on combustion simulation tasks, the framework converges within 5โ6 iterative refinement rounds. Generated code matches human-written implementations in accuracy (mean squared error: 3.1ร10โปยณ%), accelerates execution by 33.4%, improves memory efficiency by 30%, and maintains cost-effectiveness.
๐ Abstract
Agentic large language models are proposed as autonomous code generators for scientific computing, yet their reliability in high-stakes problems remains unclear. Developing computational scientific software from natural-language queries remains challenging broadly due to (a) sparse representation of domain codes during training and (b) the limited feasibility of RLHF with a small expert community. To address these limitations, this work conceptualizes an inverse approach to code design, embodied in the Chain of Unit-Physics framework: a first-principles (or primitives)-centric, multi-agent system in which human expert knowledge is encoded as unit-physics tests that explicitly constrain code generation. The framework is evaluated on a nontrivial combustion task, used here as a representative benchmark for scientific problem with realistic physical constraints. Closed-weight systems and code-focused agentic variants fail to produce correct end-to-end solvers, despite tool and web access, exhibiting four recurrent error classes: interface (syntax/API) hallucinations, overconfident assumptions, numerical/physical incoherence, and configuration fragility. Open-weight models with chain-of-thought (CoT) decoding reduce interface errors but still yield incorrect solutions. On the benchmark task, the proposed framework converges within 5-6 iterations, matches the human-expert implementation (mean error of $3.1 imes10^{-3}$ %), with a $sim$33.4 % faster runtime and a $sim$30 % efficient memory usage at a cost comparable to mid-sized commercial APIs, yielding a practical template for physics-grounded scientific code generation. As datasets and models evolve, zero-shot code accuracy will improve; however, the Chain of Unit-Physics framework goes further by embedding first-principles analysis that is foundational to scientific codes.