🤖 AI Summary
This study presents the first systematic evaluation of the scientific validity of large language models (LLMs) in automatically generating reproducible and verifiable agent-based model code from ODD protocol specifications. Using the PPHPC predator–prey model as a benchmark, the authors assess Python implementations produced by 17 LLMs across four dimensions: executability, behavioral consistency, computational efficiency, and code maintainability. Results indicate that GPT-4.1 consistently generates statistically valid and efficient code, with Claude 3.7 Sonnet performing comparably but exhibiting lower stability. Crucially, the study demonstrates that executability does not guarantee behavioral fidelity, underscoring the essential role of formal verification in scientific modeling. While these findings reveal the potential of LLMs in specification-driven modeling, they also affirm that current models cannot yet replace human expertise in rigorous scientific model development.
📝 Abstract
Large language models (LLMs) can now synthesize non-trivial executable code from textual descriptions, raising an important question: can LLMs reliably implement agent-based models from standardized specifications in a way that supports replication, verification, and validation? We address this question by evaluating 17 contemporary LLMs on a controlled ODD-to-code translation task, using the PPHPC predator-prey model as a fully specified reference. Generated Python implementations are assessed through staged executability checks, model-independent statistical comparison against a validated NetLogo baseline, and quantitative measures of runtime efficiency and maintainability. Results show that behaviorally faithful implementations are achievable but not guaranteed, and that executability alone is insufficient for scientific use. GPT-4.1 consistently produces statistically valid and efficient implementations, with Claude 3.7 Sonnet performing well but less reliably. Overall, the findings clarify both the promise and current limitations of LLMs as model engineering tools, with implications for reproducible agent-based and environmental modelling.