SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) knowledge and high-order reasoning capabilities across specialized domains—particularly in light industry, agriculture, services, and 285 graduate-level disciplines. Method: We introduce the first comprehensive, broad-spectrum benchmark for professional-domain evaluation. Our approach features a novel human–LLM collaborative filtering mechanism, integrating expert feedback and LLM responses through iterative refinement; cross-disciplinary crowdsourced annotation; structured domain partitioning; and difficulty stratification to ensure question quality and reliability. Contribution/Results: Experiments reveal that even state-of-the-art models—e.g., DeepSeek-R1—achieve only 61.82% accuracy, exposing systemic gaps in professional-domain reasoning. The benchmark incorporates rigorous annotation by 80+ domain experts, establishing a new paradigm and foundational infrastructure for evaluating general artificial intelligence capabilities.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.
Problem

Research questions and friction points this paper is trying to address.

evaluating LLMs in specialized disciplines
developing a comprehensive benchmark for 285 fields
assessing graduate-level knowledge and reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-LLM collaborative filtering
285 graduate disciplines benchmark
Iterative refinement with expert feedback
🔎 Similar Papers
No similar papers found.
M
M-A-P Team
ByteDance.Inc, 2077.AI
Xinrun Du
Xinrun Du
Multimodal Art Projection Research Community, 01.ai
LLM
Yifan Yao
Yifan Yao
Drexel University
Kaijing Ma
Kaijing Ma
Fudan University
Computer VisionMachine Learning
B
Bingli Wang
Tianyu Zheng
Tianyu Zheng
M-A-P & Tiktok Researcher
LLM
K
Kang Zhu
M
Minghao Liu
Yiming Liang
Yiming Liang
Institute of Automation of the Chinese Academy Sciences (CASIA), M-A-P
LLM
Xiaolong Jin
Xiaolong Jin
Purdue University
AI safety
Z
Zhenlin Wei
Chujie Zheng
Chujie Zheng
Qwen Team, Alibaba Group
Artifical IntelligenceLarge Language Models
K
Kaixin Deng
S
Shuyue Guo
S
Shian Jia
S
Sichao Jiang
Y
Yiyan Liao
R
Rui Li
Q
Qinrui Li
S
Sirun Li
Yizhi Li
Yizhi Li
University of Manchester, M-A-P
LLMReasoningPost-trainingComputational Music
Yunwen Li
Yunwen Li
CUHK(SZ)
D
Dehua Ma
Yuansheng Ni
Yuansheng Ni
University of Waterloo
Artificial IntelligenceNatural Language ProcessingLarge Language Models
Haoran Que
Haoran Que
Beihang University
Qiyao Wang
Qiyao Wang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsAgentic AIPatent ProcessingAI for IP
Zhoufutu Wen
Zhoufutu Wen
ByteDance SEED
LLM Evaluation
Siwei Wu
Siwei Wu
University of Manchester
Large Language ModelsNatural Language ProcessingCommonsense Reasoning
T
Tianshun Xing
M
Ming Xu
Z
Zhenzhu Yang
Zekun Moore Wang
Zekun Moore Wang
KlingAI at Kuaishou Technology
MultimodalNatural Language ProcessingLarge Language ModelsGenerative AI
Junting Zhou
Junting Zhou
Peking University
Large Language ModelAI for ScienceBioinformatics
Y
Yuelin Bai
X
Xingyuan Bu
C
Chenglin Cai
L
Liang Chen
Y
Yifan Chen
C
Chengtuo Cheng
Tianhao Cheng
Tianhao Cheng
Fudan University
Large Language Model
K
Keyi Ding
S
Siming Huang
Y
Yun Huang
Yaoru Li
Yaoru Li
Zhejiang University, Huawei Technologies
LLM Agents
Y
Yizhe Li
Zhaoqun Li
Zhaoqun Li
Zhejiang University
AI
T
Tianhao Liang
C
Chengdong Lin
H
Hongquan Lin
Yinghao Ma
Yinghao Ma
PhD candidate, Centre for Digital Music (C4DM), Queen Mary University of London
Music Information RetrievalLarge Language ModelsMultimodal LearningAudio Signal Processing
Zhongyuan Peng
Zhongyuan Peng
Fudan University
LLM
Zifan Peng
Zifan Peng
Ph.D. Candidate at HKUST(GZ)
DeFiTrustworthy AI
Q
Qige Qi
S
Shi Qiu
X
Xingwei Qu
Y
Yizhou Tan
Zili Wang
Zili Wang
StepFun LLM Researcher & M-A-P
Large Language ModelsCode Intelligence
C
Chenqing Wang
H
Hao Wang
Y
Yiya Wang
Y
Yubo Wang
J
Jiajun Xu
K
Kexin Yang
R
Ru-Qing Yuan
Yuanhao Yue
Yuanhao Yue
Fudan University
LLMNLPInstruction TuningData SynthesisFactuality
T
Tianyang Zhan
C
Chun Zhang
J
Jingyang Zhang
Xiyue Zhang
Xiyue Zhang
University of Bristol
Formal MethodsArtificial IntelligenceTrustworthy AI
X
Xingjian Zhang
Y
Yue Zhang
Y
Yongchi Zhao
X
Xiangyu Zheng
C
Chenghua Zhong
Y
Yang Gao
Zhoujun Li
Zhoujun Li
Beihang University
Artificial IntelligentNatural Language ProcessingNetwork Security
D
Dayiheng Liu
Q
Qian Liu
T
Tianyu Liu
S
Shiwen Ni
Junran Peng
Junran Peng
Assosiate Professor of USTB
3D AIGC3D Comprehension and ReconstructionEmbodied AI
Yujia Qin
Yujia Qin
ByteDance
Agent
W
Wenbo Su
G
Guoyin Wang
Shi Wang
Shi Wang
Institute of Computing Technology
knowledge graphnatural language processingneural-symbolic dual-process computing
J
Jian Yang
Min Yang
Min Yang
Bytedance
Vision Language ModelComputer VisionVideo Understanding
Meng Cao
Meng Cao
Postdoc, Carnegie Mellon University
Psychology
Xiang Yue
Xiang Yue
Carnegie Mellon University
Natural Language ProcessingLarge Language ModelsMachine Learning
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning
Wangchunshu Zhou
Wangchunshu Zhou
OPPO & M-A-P
artificial general intelligencelanguage agentslarge language modelsnatural language processing
J
Jiaheng Liu
Qunshu Lin
Qunshu Lin
Co-Founder of Abaka.AI
Data-Centric AI
W
Wenhao Huang
G
Ge Zhang