🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) deep comprehension and complex reasoning over full academic papers. To address this, we introduce AIPaperBench—the first expert-curated, AI-domain-specific benchmark for paper understanding—comprising 137 papers and 403 multi-choice questions organized into three difficulty levels, emphasizing reasoning over retrieval. Its novelty lies in an incentive-driven adversarial annotation framework to ensure question quality and a human-expert-annotated ground-truth standard for reliable evaluation. We systematically evaluate state-of-the-art LLMs using chain-of-thought prompting and retrieval-augmented generation. Results reveal that even the strongest model achieves only 39.95% accuracy—substantially below human performance—and that incorporating either reasoning or retrieval enhancements degrades performance. This constitutes the first systematic demonstration of fundamental limitations in current LLMs’ expert-level academic understanding.
📝 Abstract
While large language models (LLMs) excel at many domain-specific tasks, their ability to deeply comprehend and reason about full-length academic papers remains underexplored. Existing benchmarks often fall short of capturing such depth, either due to surface-level question design or unreliable evaluation metrics. To address this gap, we introduce ELAIPBench, a benchmark curated by domain experts to evaluate LLMs' comprehension of artificial intelligence (AI) research papers. Developed through an incentive-driven, adversarial annotation process, ELAIPBench features 403 multiple-choice questions from 137 papers. It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval. Our experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance. Moreover, we observe that frontier LLMs equipped with a thinking mode or a retrieval-augmented generation (RAG) system fail to improve final results-even harming accuracy due to overthinking or noisy retrieval. These findings underscore the significant gap between current LLM capabilities and genuine comprehension of academic papers.