CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Evaluating large language models (LLMs) on graduate-level condensed matter physics computation remains challenging due to the lack of domain-specific, process-aware benchmarks. Method: We introduce CMPhysBench—the first dedicated benchmark for this task—comprising 520+ expert-curated problems spanning magnetism, superconductivity, and strongly correlated systems. To enable fine-grained, non-binary assessment of solution derivations, we propose a tree-based expression representation and a scalable expression edit distance (SEED) metric. SEED leverages syntactic tree matching and human-validated problem formulations to improve similarity estimation accuracy. Contribution/Results: Experiments reveal severe limitations of state-of-the-art LLMs: Grok-4 achieves only 36/100 in SEED score and 28% answer accuracy on CMPhysBench, underscoring fundamental deficits in physical reasoning, symbolic manipulation, and domain-specific conceptual understanding. CMPhysBench thus provides both a rigorous evaluation framework and a diagnostic tool for advancing LLM capabilities in theoretical condensed matter physics.

Technology Category

Application Category

📝 Abstract

We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' proficiency in condensed matter physics

Assessing model performance on graduate-level calculation problems

Measuring solution accuracy with fine-grained similarity metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graduate-level curated questions for LLM evaluation

Tree-based expression representation for similarity assessment

Scalable Expression Edit Distance for partial credit

🔎 Similar Papers

MatText: Do Language Models Need More than Text & Scale for Materials Modeling?