Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-context benchmarks predominantly rely on non-scientific or synthetically generated texts, limiting their ability to evaluate large language models’ (LLMs) long-range reasoning over authentic scientific literature. To address this, we introduce SciTrek—the first long-context question-answering benchmark built exclusively from full-text scientific papers. SciTrek constructs a metadata-enriched, citation-aware literature database and automatically generates verifiable, cross-document reasoning questions via SQL queries, enabling low-supervision scaling to million-token contexts. Its key contributions are: (1) the first fine-grained, error-analyzable long-context QA benchmark specifically designed for scientific texts; and (2) multi-document aggregation and numerical reasoning tasks grounded in real citation networks. Experiments reveal severe performance limitations of state-of-the-art LLMs on SciTrek, with modest gains from fine-tuning and reinforcement learning—highlighting two critical bottlenecks: information localization and numerical computation.

Technology Category

Application Category

📝 Abstract
This paper introduces SciTrek, a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles. Current long-context benchmarks often rely on non-scientific texts, focus on simple information retrieval tasks, or employ artificial contexts. SciTrek addresses these limitations by proposing complex questions that require information aggregation and synthesis across multiple full-text scientific articles. Questions and their ground-truth answers are automatically generated by formulating them as SQL queries over a database constructed from article metadata (titles, authors, and references). The SQL operations provide explicit, verifiable reasoning steps for fine-grained error analysis, and the construction process scales to contexts up to 1M tokens with minimal supervision. Extensive experiments on a diverse set of open-weight and proprietary LLMs demonstrate that SciTrek poses a significant challenge as the context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Our analysis reveals systematic shortcomings in models' abilities to perform basic numerical operations and accurately locate specific information in long contexts.
Problem

Research questions and friction points this paper is trying to address.

Evaluating long-context reasoning in LLMs using scientific articles
Addressing limitations of non-scientific benchmarks with simple tasks
Assessing information aggregation across multiple full-text research papers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmark generation using SQL queries
Complex questions requiring multi-article information synthesis
Scalable construction process supporting million-token contexts
🔎 Similar Papers
No similar papers found.
M
Miao Li
School of Informatics, The University of Edinburgh
A
Alexander Gurung
School of Informatics, The University of Edinburgh
I
Irina Saparina
School of Informatics, The University of Edinburgh
Mirella Lapata
Mirella Lapata
School of Informatics, Edinburgh University
natural language processing