ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) deep comprehension and complex reasoning over full academic papers. To address this, we introduce AIPaperBench—the first expert-curated, AI-domain-specific benchmark for paper understanding—comprising 137 papers and 403 multi-choice questions organized into three difficulty levels, emphasizing reasoning over retrieval. Its novelty lies in an incentive-driven adversarial annotation framework to ensure question quality and a human-expert-annotated ground-truth standard for reliable evaluation. We systematically evaluate state-of-the-art LLMs using chain-of-thought prompting and retrieval-augmented generation. Results reveal that even the strongest model achieves only 39.95% accuracy—substantially below human performance—and that incorporating either reasoning or retrieval enhancements degrades performance. This constitutes the first systematic demonstration of fundamental limitations in current LLMs’ expert-level academic understanding.

Technology Category

Application Category

📝 Abstract
While large language models (LLMs) excel at many domain-specific tasks, their ability to deeply comprehend and reason about full-length academic papers remains underexplored. Existing benchmarks often fall short of capturing such depth, either due to surface-level question design or unreliable evaluation metrics. To address this gap, we introduce ELAIPBench, a benchmark curated by domain experts to evaluate LLMs' comprehension of artificial intelligence (AI) research papers. Developed through an incentive-driven, adversarial annotation process, ELAIPBench features 403 multiple-choice questions from 137 papers. It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval. Our experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance. Moreover, we observe that frontier LLMs equipped with a thinking mode or a retrieval-augmented generation (RAG) system fail to improve final results-even harming accuracy due to overthinking or noisy retrieval. These findings underscore the significant gap between current LLM capabilities and genuine comprehension of academic papers.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' deep comprehension of full academic papers
Assessing reasoning beyond surface-level retrieval in AI research
Benchmarking expert-level understanding through adversarial annotation process
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-curated benchmark for AI paper comprehension
Adversarial annotation process with incentive-driven design
Multi-difficulty questions emphasizing non-trivial reasoning
🔎 Similar Papers
No similar papers found.