Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study addresses the absence of a systematic benchmark for evaluating large language models’ (LLMs’) conceptual understanding and reasoning capabilities in quantum computing. The authors construct a comprehensive evaluation dataset comprising 2,700 questions spanning core quantum computing topics, integrating expert-authored items, questions extracted from research papers, and problems containing false premises to assess both comprehension and critical reasoning. The dataset is developed through a hybrid approach combining expert knowledge, LLM-assisted generation, and manual validation, yielding multiple question types—including multiple-choice, open-ended, and premise-fault reasoning tasks. Experimental results show that Claude Opus 4.5 achieves an accuracy of 84%, surpassing the average human expert performance of 74%; however, its performance declines on expert-crafted questions and advanced security-related topics, and most models struggle to identify and correct erroneous premises.

Technology Category

Application Category

📝 Abstract

Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

quantum computing

large language models

reasoning evaluation

concept understanding

false premise detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantum-Audit

reasoning evaluation

large language models