CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) lack systematic evaluation on advanced theoretical problems in hard sciences—particularly in condensed matter theory (CMT)—despite growing interest in their scientific reasoning capabilities. Method: We introduce CMT-Benchmark, the first expert-curated benchmark for CMT, comprising 50 research-level tasks spanning core quantum many-body and classical statistical mechanics methods—including Hartree–Fock, exact diagonalization, quantum/variational Monte Carlo, and density matrix renormalization group (DMRG). It features a novel procedural symbolic verification framework supporting noncommuting operator normal ordering and rigorous physical dimensionality/symmetry constraint checking. Contribution/Results: Evaluation across 17 state-of-the-art LLMs reveals severe limitations: even GPT-5 solves only 30% of tasks; the mean accuracy is merely 11.4±2.1%; and 18 problems yield zero correct solutions. This constitutes the first systematic evidence of fundamental deficits in LLMs’ ability to reason over physical principles and perform rigorous theoretical physics derivations.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown remarkable progress in coding and math problem-solving, but evaluation on advanced research-level problems in hard sciences remains scarce. To fill this gap, we present CMT-Benchmark, a dataset of 50 problems covering condensed matter theory (CMT) at the level of an expert researcher. Topics span analytical and computational approaches in quantum many-body, and classical statistical mechanics. The dataset was designed and verified by a panel of expert researchers from around the world. We built the dataset through a collaborative environment that challenges the panel to write and refine problems they would want a research assistant to solve, including Hartree-Fock, exact diagonalization, quantum/variational Monte Carlo, density matrix renormalization group (DMRG), quantum/classical statistical mechanics, and model building. We evaluate LLMs by programmatically checking solutions against expert-supplied ground truth. We developed machine-grading, including symbolic handling of non-commuting operators via normal ordering. They generalize across tasks too. Our evaluations show that frontier models struggle with all of the problems in the dataset, highlighting a gap in the physical reasoning skills of current LLMs. Notably, experts identified strategies for creating increasingly difficult problems by interacting with the LLMs and exploiting common failure modes. The best model, GPT5, solves 30% of the problems; average across 17 models (GPT, Gemini, Claude, DeepSeek, Llama) is 11.4$pm$2.1%. Moreover, 18 problems are solved by none of the 17 models, and 26 by at most one. These unsolved problems span Quantum Monte Carlo, Variational Monte Carlo, and DMRG. Answers sometimes violate fundamental symmetries or have unphysical scaling dimensions. We believe this benchmark will guide development toward capable AI research assistants and tutors.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on advanced condensed matter theory research problems
Assessing physical reasoning skills of AI models in quantum mechanics
Identifying gaps in AI capabilities for scientific problem-solving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Built expert-verified condensed matter theory benchmark
Developed machine grading with symbolic operator handling
Evaluated LLMs against unsolved research-level problems
🔎 Similar Papers
No similar papers found.
H
Haining Pan
Department of Physics and Astronomy, Rutgers University
J
James V. Roggeveen
School of Engineering and Applied Sciences, Harvard University
Erez Berg
Erez Berg
Weizmann Institute of Science
Theoretical condensed matter physics
J
Juan Carrasquilla
Department of Physics, ETH Zürich
Debanjan Chowdhury
Debanjan Chowdhury
Rosevear Assistant Professor, Physics Department, Cornell University
Non-Fermi liquidsSpin liquidsGapless phasesQuantum transportDriven quantum matter
Surya Ganguli
Surya Ganguli
Associate Professor, Stanford University
NeurosciencePhysicsMachine Learning
F
Federico Ghimenti
Department of Applied Physics, Stanford University
J
Juraj Hasik
Department of Physics, University of Zürich
H
Henry Hunt
Department of Applied Physics, Stanford University
Hong-Chen Jiang
Hong-Chen Jiang
Stanford Institute for Materials and Energy Sciences, SLAC National Accelerator Laboratory
M
Mason Kamb
Department of Applied Physics, Stanford University
Ying-Jer Kao
Ying-Jer Kao
Professor of Physics, National Taiwan University
Condensed matter physics
E
Ehsan Khatami
Department of Physics and Astronomy, San José State University
M
Michael J. Lawler
Department of Physics, Cornell University
D
Di Luo
Department of Electrical and Computer Engineering, University of California, Los Angeles
Titus Neupert
Titus Neupert
University of Zurich
Theoretical Condensed Matter Physics
X
Xiaoliang Qi
Department of Physics, Stanford University
Michael P. Brenner
Michael P. Brenner
Harvard University
Eun-Ah Kim
Eun-Ah Kim
Professor of Physics, Cornell University