🤖 AI Summary
Evaluating large language models (LLMs) on graduate-level condensed matter physics computation remains challenging due to the lack of domain-specific, process-aware benchmarks. Method: We introduce CMPhysBench—the first dedicated benchmark for this task—comprising 520+ expert-curated problems spanning magnetism, superconductivity, and strongly correlated systems. To enable fine-grained, non-binary assessment of solution derivations, we propose a tree-based expression representation and a scalable expression edit distance (SEED) metric. SEED leverages syntactic tree matching and human-validated problem formulations to improve similarity estimation accuracy. Contribution/Results: Experiments reveal severe limitations of state-of-the-art LLMs: Grok-4 achieves only 36/100 in SEED score and 28% answer accuracy on CMPhysBench, underscoring fundamental deficits in physical reasoning, symbolic manipulation, and domain-specific conceptual understanding. CMPhysBench thus provides both a rigorous evaluation framework and a diagnostic tool for advancing LLM capabilities in theoretical condensed matter physics.
📝 Abstract
We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.