SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing benchmarks inadequately assess large language models’ (LLMs) knowledge and high-order reasoning capabilities across specialized domains—particularly in light industry, agriculture, services, and 285 graduate-level disciplines. Method: We introduce the first comprehensive, broad-spectrum benchmark for professional-domain evaluation. Our approach features a novel human–LLM collaborative filtering mechanism, integrating expert feedback and LLM responses through iterative refinement; cross-disciplinary crowdsourced annotation; structured domain partitioning; and difficulty stratification to ensure question quality and reliability. Contribution/Results: Experiments reveal that even state-of-the-art models—e.g., DeepSeek-R1—achieve only 61.82% accuracy, exposing systemic gaps in professional-domain reasoning. The benchmark incorporates rigorous annotation by 80+ domain experts, establishing a new paradigm and foundational infrastructure for evaluating general artificial intelligence capabilities.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

Problem

Research questions and friction points this paper is trying to address.

evaluating LLMs in specialized disciplines

developing a comprehensive benchmark for 285 fields

assessing graduate-level knowledge and reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-LLM collaborative filtering

285 graduate disciplines benchmark

Iterative refinement with expert feedback

🔎 Similar Papers

No similar papers found.