CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating large language models (LLMs) on graduate-level condensed matter physics computation remains challenging due to the lack of domain-specific, process-aware benchmarks. Method: We introduce CMPhysBench—the first dedicated benchmark for this task—comprising 520+ expert-curated problems spanning magnetism, superconductivity, and strongly correlated systems. To enable fine-grained, non-binary assessment of solution derivations, we propose a tree-based expression representation and a scalable expression edit distance (SEED) metric. SEED leverages syntactic tree matching and human-validated problem formulations to improve similarity estimation accuracy. Contribution/Results: Experiments reveal severe limitations of state-of-the-art LLMs: Grok-4 achieves only 36/100 in SEED score and 28% answer accuracy on CMPhysBench, underscoring fundamental deficits in physical reasoning, symbolic manipulation, and domain-specific conceptual understanding. CMPhysBench thus provides both a rigorous evaluation framework and a diagnostic tool for advancing LLM capabilities in theoretical condensed matter physics.

Technology Category

Application Category

📝 Abstract
We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' proficiency in condensed matter physics
Assessing model performance on graduate-level calculation problems
Measuring solution accuracy with fine-grained similarity metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graduate-level curated questions for LLM evaluation
Tree-based expression representation for similarity assessment
Scalable Expression Edit Distance for partial credit
🔎 Similar Papers
No similar papers found.
W
Weida Wang
Shanghai AI Lab
D
Dongchen Huang
Beijing National Laboratory for Condensed Matter Physics and Institute of Physics, Chinese Academy of Sciences
Jiatong Li
Jiatong Li
PhD candidate, Hong Kong Polytechnic University
Natural Language ProcessingBioinformaticsMolecule Discovery
T
Tengchao Yang
Tongji University
Ziyang Zheng
Ziyang Zheng
Shanghai Jiao Tong University
Signal ProcessingInverse ProblemPhotonic Computing
D
Di Zhang
Shanghai AI Lab
D
Dong Han
Shanghai AI Lab
B
Benteng Chen
Shanghai AI Lab
B
Binzhao Luo
Condensed Matter Physics Data Center, Chinese Academy of Sciences
Z
Zhiyu Liu
Condensed Matter Physics Data Center, Chinese Academy of Sciences
K
Kunling Liu
Condensed Matter Physics Data Center, Chinese Academy of Sciences
Z
Zhiyuan Gao
Condensed Matter Physics Data Center, Chinese Academy of Sciences
S
Shiqi Geng
Shanghai AI Lab
W
Wei Ma
Tongji University
J
Jiaming Su
Tongji University
X
Xin Li
Tongji University
S
Shuchen Pu
Shanghai AI Lab
Y
Yuhan Shui
Shanghai AI Lab
Qianjia Cheng
Qianjia Cheng
Shanghai AI Lab
Z
Zhihao Dou
Shanghai AI Lab
D
Dongfei Cui
Shanghai AI Lab
C
Changyong He
Tongji University
J
Jin Zeng
Tongji University
Zeke Xie
Zeke Xie
Assistant Professor, The Hong Kong University of Science and Technology (Guangzhou)/ PI, xLeaF Lab
Generative AIData-centric AILarge ModelsDeep Learning TheoryOptimization
Mao Su
Mao Su
Shanghai AI Laboratory
PhysicsAI