🤖 AI Summary
The materials science community lacks domain-specific evaluation benchmarks, hindering rigorous assessment of large language models (LLMs) on technical reasoning and factual accuracy.
Method: We introduce MSQA, the first graduate-level benchmark for materials science, comprising 1,757 questions spanning seven subdomains—including structure–property relationships and synthesis protocols—and featuring a novel dual-mode question format combining detailed explanation with true/false judgment to capture domain specificity and distributional shift challenges. We conduct standardized evaluation across ten state-of-the-art LLMs (including closed-source APIs, open-weight models, and domain-finetuned variants), complemented by accuracy analysis, error attribution, and generalization diagnostics.
Results: A substantial performance gap emerges: top closed-source models achieve 84.5% accuracy, while open-weight models average ~60.5%; notably, most domain-finetuned models underperform due to overfitting. This reveals critical limitations in current LLM capabilities for advanced materials science reasoning.
📝 Abstract
Despite recent advances in large language models (LLMs) for materials science, there is a lack of benchmarks for evaluating their domain-specific knowledge and complex reasoning abilities. To bridge this gap, we introduce MSQA, a comprehensive evaluation benchmark of 1,757 graduate-level materials science questions in two formats: detailed explanatory responses and binary True/False assessments. MSQA distinctively challenges LLMs by requiring both precise factual knowledge and multi-step reasoning across seven materials science sub-fields, such as structure-property relationships, synthesis processes, and computational modeling. Through experiments with 10 state-of-the-art LLMs, we identify significant gaps in current LLM performance. While API-based proprietary LLMs achieve up to 84.5% accuracy, open-source (OSS) LLMs peak around 60.5%, and domain-specific LLMs often underperform significantly due to overfitting and distributional shifts. MSQA represents the first benchmark to jointly evaluate the factual and reasoning capabilities of LLMs crucial for LLMs in advanced materials science.