🤖 AI Summary
This paper addresses the challenge of evaluating large language models’ (LLMs) capabilities in applying computational materials science tools. It introduces the first domain-specific benchmark tailored for physical simulation and analysis software (e.g., pymatgen). Methodologically, it proposes a tool-aware evaluation framework that automatically constructs dual-task datasets—comprising question-answering and code generation—from authentic documentation and source code, covering 49 tasks and 138 subtasks. A secure sandbox execution environment and a multi-dimensional functional correctness assessment protocol are designed to ensure rigorous, reproducible evaluation. Key contributions include: (1) the first empirical finding that general-purpose LLMs outperform domain-finetuned models in material science tool invocation; (2) identification of code conciseness as a critical factor significantly improving execution success rates; and (3) establishment of a reproducible, extensible evaluation paradigm for AI for Science.
📝 Abstract
Large language models (LLMs) are increasingly applied to materials science questions, including literature comprehension, property prediction, materials discovery and alloy design. At the same time, a wide range of physics-based computational approaches have been developed in which materials properties can be calculated. Here, we propose a benchmark application to evaluate the proficiency of LLMs to answer materials science questions through the generation and safe execution of codes based on such physics-based computational materials science packages. MatTools is built on two complementary components: a materials simulation tool question-answer (QA) benchmark and a real-world tool-usage benchmark. We designed an automated methodology to efficiently collect real-world materials science tool-use examples. The QA benchmark, derived from the pymatgen (Python Materials Genomics) codebase and documentation, comprises 69,225 QA pairs that assess the ability of an LLM to understand materials science tools. The real-world benchmark contains 49 tasks (138 subtasks) requiring the generation of functional Python code for materials property calculations. Our evaluation of diverse LLMs yields three key insights: (1)Generalists outshine specialists;(2)AI knows AI; and (3)Simpler is better. MatTools provides a standardized framework for assessing and improving LLM capabilities for materials science tool applications, facilitating the development of more effective AI systems for materials science and general scientific research.