🤖 AI Summary
This work addresses the lack of effective evaluation methods for assessing the scientific validity of domain-specific language (DSL) code—such as LAMMPS molecular dynamics input scripts—generated by large language models (LLMs). To tackle this challenge, the authors propose a lightweight validation framework that combines input file normalization, an extensible DSL parser, and static syntactic and semantic checks. This approach enables domain experts to efficiently verify LLM-generated outputs without requiring deep expertise in the target DSL. By circumventing costly runtime execution, the framework facilitates systematic benchmarking of mainstream LLMs on scientific DSL generation tasks, revealing their current limitations. The study thus provides a practical pathway toward the safe integration of LLMs into specialized scientific computing workflows.
📝 Abstract
Large language models (LLMs) are changing the way researchers interact with code and data in scientific computing. While their ability to generate general-purpose code is well established, their effectiveness in producing scientifically valid code/input scripting for domain-specific languages (DSLs) remains largely unexplored. We propose an evaluation procedure that enables domain experts (who may not be experts in the DSL) to assess the validity of LLM-generated input files for LAMMPS, a widely used molecular dynamics (MD) code, and to use those assessments to evaluate the performance of state-of-the-art LLMs and identify common issues. Key to the evaluation procedure are a normalization step to generate canonical files and an extensible parser for syntax analysis. The following steps isolate common errors without incurring costly tests (in time and computational resources). Once a working input file is generated, LLMs can accelerate verification tests. Our findings highlight limitations of LLMs in generating scientific DSLs and a practical path forward for their integration into domain-specific computational ecosystems by domain experts.