š¤ AI Summary
Large language models (LLMs) exhibit insufficient knowledge reliability for materials science applications, particularly within the processingāstructureāpropertyāperformance (PSPP) paradigm.
Method: We systematically evaluate LLMsā reasoning capabilities across PSPP stages and introduce MatKBāthe first lightweight, domain-specific knowledge benchmark for materials scienceācovering fundamental factual tasks including periodic table knowledge, phase diagram fundamentals, and thermodynamic relationships. We analyze tokenizer and vocabulary impacts on material entity disambiguation and employ structured prompting with multi-source factual verification to quantify performance of leading open-weight models (Llama-3, Qwen2, Phi-3).
Contribution/Results: Results reveal substantial factual inaccuracies: accuracy in structureāproperty mapping falls below 40%, and generic LLMs consistently underperform domain-specific tools. Tokenizer design critically affects material entity representation fidelity. Our findings underscore the necessity of knowledge augmentation or domain adaptationāproviding an empirical benchmark and methodological framework to guide LLM selection and specialization for engineering applications.
š Abstract
Large Language Models (LLMs) are increasingly applied in the fields of mechanical engineering and materials science. As models that establish connections through the interface of language, LLMs can be applied for step-wise reasoning through the Processing-Structure-Property-Performance chain of material science and engineering. Current LLMs are built for adequately representing a dataset, which is the most part of the accessible internet. However, the internet mostly contains non-scientific content. If LLMs should be applied for engineering purposes, it is valuable to investigate models for their intrinsic knowledge -- here: the capacity to generate correct information about materials. In the current work, for the example of the Periodic Table of Elements, we highlight the role of vocabulary and tokenization for the uniqueness of material fingerprints, and the LLMs' capabilities of generating factually correct output of different state-of-the-art open models. This leads to a material knowledge benchmark for an informed choice, for which steps in the PSPP chain LLMs are applicable, and where specialized models are required.