ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing academic evaluation benchmarks inadequately assess higher-order reasoning—particularly cross-disciplinary integration and comprehension of top-tier journal-level knowledge. To address this gap, we introduce AcadReason: the first benchmark explicitly designed to evaluate higher-order academic reasoning across multiple domains. It comprises 50 challenging, expert-annotated questions drawn from recent papers in leading journals across computer science, economics, law, mathematics, and philosophy. Rigorous quality control ensures annotation fidelity and task difficulty. The benchmark supports standardized evaluation under mainstream reasoning paradigms, including zero-shot and chain-of-thought prompting. We evaluate over ten state-of-the-art large language models and autonomous agent systems. Results reveal profound limitations: even the strongest model (GPT-5) scores only 16/100, while the best-performing agent achieves less than 40/100. These findings systematically expose fundamental gaps in current AI systems’ capacity for advanced, super-intelligent academic research tasks.

Technology Category

Application Category

📝 Abstract
In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning abilities of LLMs on academic knowledge
Addressing lack of rigorous benchmarks for high-level reasoning
Testing models across five complex domains with expert problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing Acadreason benchmark for academic reasoning evaluation
Evaluating LLMs and agents across five high-reasoning domains
Demonstrating capability gaps in super-intelligent academic tasks
🔎 Similar Papers
No similar papers found.
X
Xin Gui
OPPO AI Agent Team
K
King Zhu
OPPO AI Agent Team
J
JinCheng Ren
OPPO AI Agent Team
Q
Qianben Chen
OPPO AI Agent Team
Zekun Moore Wang
Zekun Moore Wang
KlingAI at Kuaishou Technology
MultimodalNatural Language ProcessingLarge Language ModelsGenerative AI
Y
Yizhi LI
OPPO AI Agent Team
Xinpeng Liu
Xinpeng Liu
Shanghai Jiao Tong University
Human Motion UnderstandingEmbodied AIDigital Human
X
Xiaowan Li
OPPO AI Agent Team
W
Wenli Ren
OPPO AI Agent Team
L
Linyu Miao
OPPO AI Agent Team
Tianrui Qin
Tianrui Qin
OPPO
Agentic AIDeep LearningLLM Security
Z
Ziqi Shu
OPPO AI Agent Team
H
He Zhu
OPPO AI Agent Team
X
Xiangru Tang
OPPO AI Agent Team
Dingfeng Shi
Dingfeng Shi
OPPO
Video AnalysisAgentic LLM
J
Jiaheng Liu
OPPO AI Agent Team
Yuchen Eleanor Jiang
Yuchen Eleanor Jiang
OPPO
natural language processingmachine learning
M
Minghao Liu
OPPO AI Agent Team
G
Ge Zhang
OPPO AI Agent Team
Wangchunshu Zhou
Wangchunshu Zhou
OPPO & M-A-P
artificial general intelligencelanguage agentslarge language modelsnatural language processing