SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of evaluating large language models (LLMs) on long scientific document understanding. We introduce SciLongBench, the first benchmark for long-context scientific reasoning grounded in academic citation networks. Methodologically, we propose a citation-driven automatic ground-truth generation mechanism and a data-contamination-avoiding dynamic update framework, enabling controllable, multi-granular evaluation across context length and reasoning types. Our approach integrates scholarly document structure parsing, citation graph modeling, self-supervised label synthesis, and context-sensitive prompt engineering. Evaluating eight state-of-the-art LLMs on ICLR 2025 submissions, we systematically uncover critical bottlenecks—particularly in factual traceability, logical deduction, and cross-paragraph integration—and characterize their capability distributions. SciLongBench establishes a scalable, reproducible paradigm for rigorous, citation-aware scientific reasoning assessment over extended contexts.

Technology Category

Application Category

📝 Abstract
Evaluating large language models' (LLMs) long-context understanding capabilities remains challenging. We present SCALAR (Scientific Citation-based Live Assessment of Long-context Academic Reasoning), a novel benchmark that leverages academic papers and their citation networks. SCALAR features automatic generation of high-quality ground truth labels without human annotation, controllable difficulty levels, and a dynamic updating mechanism that prevents data contamination. Using ICLR 2025 papers, we evaluate 8 state-of-the-art LLMs, revealing key insights about their capabilities and limitations in processing long scientific documents across different context lengths and reasoning types. Our benchmark provides a reliable and sustainable way to track progress in long-context understanding as LLM capabilities evolve.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' long-context understanding
Automatic generation of ground truth labels
Evaluating LLMs on scientific documents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages citation networks for assessment
Automatic ground truth label generation
Dynamic updating prevents data contamination
🔎 Similar Papers
No similar papers found.
Renxi Wang
Renxi Wang
MBZUAI
Natural Language Processing
H
Honglin Mu
MBZUAI, LibrAI
L
Liqun Ma
MBZUAI
L
Lizhi Lin
LibrAI, Tsinghua University
Yunlong Feng
Yunlong Feng
HIT-SCIR
NLP
Timothy Baldwin
Timothy Baldwin
MBZUAI and The University of Melbourne
computational linguisticsnatural language processingartificial intelligence
X
Xudong Han
MBZUAI, LibrAI
H
Haonan Li
MBZUAI, LibrAI