🤖 AI Summary
This work addresses the challenge of evaluating large language models (LLMs) on long scientific document understanding. We introduce SciLongBench, the first benchmark for long-context scientific reasoning grounded in academic citation networks. Methodologically, we propose a citation-driven automatic ground-truth generation mechanism and a data-contamination-avoiding dynamic update framework, enabling controllable, multi-granular evaluation across context length and reasoning types. Our approach integrates scholarly document structure parsing, citation graph modeling, self-supervised label synthesis, and context-sensitive prompt engineering. Evaluating eight state-of-the-art LLMs on ICLR 2025 submissions, we systematically uncover critical bottlenecks—particularly in factual traceability, logical deduction, and cross-paragraph integration—and characterize their capability distributions. SciLongBench establishes a scalable, reproducible paradigm for rigorous, citation-aware scientific reasoning assessment over extended contexts.
📝 Abstract
Evaluating large language models' (LLMs) long-context understanding capabilities remains challenging. We present SCALAR (Scientific Citation-based Live Assessment of Long-context Academic Reasoning), a novel benchmark that leverages academic papers and their citation networks. SCALAR features automatic generation of high-quality ground truth labels without human annotation, controllable difficulty levels, and a dynamic updating mechanism that prevents data contamination. Using ICLR 2025 papers, we evaluate 8 state-of-the-art LLMs, revealing key insights about their capabilities and limitations in processing long scientific documents across different context lengths and reasoning types. Our benchmark provides a reliable and sustainable way to track progress in long-context understanding as LLM capabilities evolve.