MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The materials science community lacks domain-specific evaluation benchmarks, hindering rigorous assessment of large language models (LLMs) on technical reasoning and factual accuracy. Method: We introduce MSQA, the first graduate-level benchmark for materials science, comprising 1,757 questions spanning seven subdomains—including structure–property relationships and synthesis protocols—and featuring a novel dual-mode question format combining detailed explanation with true/false judgment to capture domain specificity and distributional shift challenges. We conduct standardized evaluation across ten state-of-the-art LLMs (including closed-source APIs, open-weight models, and domain-finetuned variants), complemented by accuracy analysis, error attribution, and generalization diagnostics. Results: A substantial performance gap emerges: top closed-source models achieve 84.5% accuracy, while open-weight models average ~60.5%; notably, most domain-finetuned models underperform due to overfitting. This reveals critical limitations in current LLM capabilities for advanced materials science reasoning.

Technology Category

Application Category

📝 Abstract
Despite recent advances in large language models (LLMs) for materials science, there is a lack of benchmarks for evaluating their domain-specific knowledge and complex reasoning abilities. To bridge this gap, we introduce MSQA, a comprehensive evaluation benchmark of 1,757 graduate-level materials science questions in two formats: detailed explanatory responses and binary True/False assessments. MSQA distinctively challenges LLMs by requiring both precise factual knowledge and multi-step reasoning across seven materials science sub-fields, such as structure-property relationships, synthesis processes, and computational modeling. Through experiments with 10 state-of-the-art LLMs, we identify significant gaps in current LLM performance. While API-based proprietary LLMs achieve up to 84.5% accuracy, open-source (OSS) LLMs peak around 60.5%, and domain-specific LLMs often underperform significantly due to overfitting and distributional shifts. MSQA represents the first benchmark to jointly evaluate the factual and reasoning capabilities of LLMs crucial for LLMs in advanced materials science.
Problem

Research questions and friction points this paper is trying to address.

Lack of benchmarks for LLMs in materials science knowledge and reasoning
Need for evaluating both factual accuracy and multi-step reasoning in materials science
Performance gaps between proprietary, open-source, and domain-specific LLMs in materials science
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MSQA for graduate-level materials science evaluation
Combines factual knowledge and multi-step reasoning challenges
Benchmarks 10 LLMs, revealing performance gaps
🔎 Similar Papers
No similar papers found.