ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing benchmarks inadequately assess large language models’ (LLMs) deep comprehension and complex reasoning over full academic papers. To address this, we introduce AIPaperBench—the first expert-curated, AI-domain-specific benchmark for paper understanding—comprising 137 papers and 403 multi-choice questions organized into three difficulty levels, emphasizing reasoning over retrieval. Its novelty lies in an incentive-driven adversarial annotation framework to ensure question quality and a human-expert-annotated ground-truth standard for reliable evaluation. We systematically evaluate state-of-the-art LLMs using chain-of-thought prompting and retrieval-augmented generation. Results reveal that even the strongest model achieves only 39.95% accuracy—substantially below human performance—and that incorporating either reasoning or retrieval enhancements degrades performance. This constitutes the first systematic demonstration of fundamental limitations in current LLMs’ expert-level academic understanding.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) excel at many domain-specific tasks, their ability to deeply comprehend and reason about full-length academic papers remains underexplored. Existing benchmarks often fall short of capturing such depth, either due to surface-level question design or unreliable evaluation metrics. To address this gap, we introduce ELAIPBench, a benchmark curated by domain experts to evaluate LLMs' comprehension of artificial intelligence (AI) research papers. Developed through an incentive-driven, adversarial annotation process, ELAIPBench features 403 multiple-choice questions from 137 papers. It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval. Our experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance. Moreover, we observe that frontier LLMs equipped with a thinking mode or a retrieval-augmented generation (RAG) system fail to improve final results-even harming accuracy due to overthinking or noisy retrieval. These findings underscore the significant gap between current LLM capabilities and genuine comprehension of academic papers.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' deep comprehension of full academic papers

Assessing reasoning beyond surface-level retrieval in AI research

Benchmarking expert-level understanding through adversarial annotation process

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-curated benchmark for AI paper comprehension

Adversarial annotation process with incentive-driven design

Multi-difficulty questions emphasizing non-trivial reasoning

🔎 Similar Papers

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms