SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the challenge of evaluating large language models (LLMs) on long scientific document understanding. We introduce SciLongBench, the first benchmark for long-context scientific reasoning grounded in academic citation networks. Methodologically, we propose a citation-driven automatic ground-truth generation mechanism and a data-contamination-avoiding dynamic update framework, enabling controllable, multi-granular evaluation across context length and reasoning types. Our approach integrates scholarly document structure parsing, citation graph modeling, self-supervised label synthesis, and context-sensitive prompt engineering. Evaluating eight state-of-the-art LLMs on ICLR 2025 submissions, we systematically uncover critical bottlenecks—particularly in factual traceability, logical deduction, and cross-paragraph integration—and characterize their capability distributions. SciLongBench establishes a scalable, reproducible paradigm for rigorous, citation-aware scientific reasoning assessment over extended contexts.

Technology Category

Application Category

📝 Abstract

Evaluating large language models' (LLMs) long-context understanding capabilities remains challenging. We present SCALAR (Scientific Citation-based Live Assessment of Long-context Academic Reasoning), a novel benchmark that leverages academic papers and their citation networks. SCALAR features automatic generation of high-quality ground truth labels without human annotation, controllable difficulty levels, and a dynamic updating mechanism that prevents data contamination. Using ICLR 2025 papers, we evaluate 8 state-of-the-art LLMs, revealing key insights about their capabilities and limitations in processing long scientific documents across different context lengths and reasoning types. Our benchmark provides a reliable and sustainable way to track progress in long-context understanding as LLM capabilities evolve.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' long-context understanding

Automatic generation of ground truth labels

Evaluating LLMs on scientific documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages citation networks for assessment

Automatic ground truth label generation

Dynamic updating prevents data contamination

🔎 Similar Papers

No similar papers found.