Expert Evaluation of LLM World Models: A High-$T_c$ Superconductivity Case Study

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study evaluates the accuracy and comprehensiveness of large language models (LLMs) in answering complex, domain-specific questions about high-temperature copper-oxide superconductors, probing their capacity for expert-level scientific literature understanding. Method: We construct a domain-specific knowledge base comprising 1,726 peer-reviewed papers and curate a rigorous benchmark of 67 deep, multi-faceted questions. We propose an expert-designed, multidimensional evaluation framework assessing balance, factual completeness, conciseness, and evidential support. Further, we develop two retrieval-augmented systems—integrating retrieval-augmented generation (RAG) with multimodal (text-and-figure) retrieval. Contribution/Results: RAG-based systems significantly outperform closed-source baseline LLMs in factual coverage and evidence grounding, revealing both the emergent potential and critical limitations of current LLMs in scientific reasoning—particularly regarding domain-specific inference, citation fidelity, and multimodal evidence integration.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. Using the field of high-temperature cuprates as an exemplar, we evaluate the ability of LLM systems to understand the literature at the level of an expert. We construct an expert-curated database of 1,726 scientific papers that covers the history of the field, and a set of 67 expert-formulated questions that probe deep understanding of the literature. We then evaluate six different LLM-based systems for answering these questions, including both commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. Experts then evaluate the answers of these systems against a rubric that assesses balanced perspectives, factual comprehensiveness, succinctness, and evidentiary support. Among the six systems two using RAG on curated literature outperformed existing closed models across key metrics, particularly in providing comprehensive and well-supported answers. We discuss promising aspects of LLM performances as well as critical short-comings of all the models. The set of expert-formulated questions and the rubric will be valuable for assessing expert level performance of LLM based reasoning systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM accuracy in specialized scientific domain knowledge
Assessing expert-level literature comprehension using cuprate superconductivity case
Developing methodology to measure LLM performance on expert-curated questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated six LLM systems using expert questions
Used retrieval-augmented generation with curated literature
Assessed performance through expert rubric with multiple criteria
🔎 Similar Papers
No similar papers found.
Haoyu Guo
Haoyu Guo
Shanghai AI Lab
Computer Vision3D Vision
M
Maria Tikhanovskaya
Google, USA;Department of Physics, Harvard University, USA
Paul Raccuglia
Paul Raccuglia
Google
Machine LearningNatural Language Processing
A
Alexey Vlaskin
Google, USA
C
Chris Co
Google, USA
Daniel J. Liebling
Daniel J. Liebling
Google, USA
S
Scott Ellsworth
Google, USA
M
Matthew Abraham
Google, USA
E
Elizabeth Dorfman
Google, USA
N
N. P. Armitage
William H. Miller III Department of Physics and Astronomy, The Johns Hopkins University, Baltimore, MD, USA
Chunhan Feng
Chunhan Feng
Center for Computational Quantum Physics, Flatiron Institute, USA
Antoine Georges
Antoine Georges
Center for Computational Quantum Physics, Flatiron Institute, USA;Collège de France, Paris, France;CPHT, CNRS, Ecole Polytechnique, IP Paris, France;DQMP , Université de Genève, Suisse
Olivier Gingras
Olivier Gingras
Center for Computational Quantum Physics, Flatiron Institute, USA;Université Paris-Saclay, CNRS, CEA, Institut de physique théorique, France
Dominik Kiese
Dominik Kiese
Center for Computational Quantum Physics, Flatiron Institute, USA
S
S. Kivelson
Department of Physics, Stanford University, USA
Vadim Oganesyan
Vadim Oganesyan
Physics Program and Initiative for the Theoretical Sciences, CUNY , USA;Department of Physics and Astronomy, College of Staten Island, CUNY , USA
B
B. J. Ramshaw
Department of Physics, Cornell University, USA
Subir Sachdev
Subir Sachdev
Department of Physics, Harvard University, USA
T. Senthil
T. Senthil
Department of Physics, Massachusetts Institute of Technology, USA
J
J. Tranquada
Condensed Matter Physics and Materials Science Division, Brookhaven National Laboratory, USA
Michael P. Brenner
Michael P. Brenner
Harvard University
Subhashini Venugopalan
Subhashini Venugopalan
University of Texas at Austin
Natural Language ProcessingComputer VisionMachine Learning
Eun-Ah Kim
Eun-Ah Kim
Professor of Physics, Cornell University