🤖 AI Summary
Existing long-context benchmarks predominantly rely on non-scientific or synthetically generated texts, limiting their ability to evaluate large language models’ (LLMs) long-range reasoning over authentic scientific literature. To address this, we introduce SciTrek—the first long-context question-answering benchmark built exclusively from full-text scientific papers. SciTrek constructs a metadata-enriched, citation-aware literature database and automatically generates verifiable, cross-document reasoning questions via SQL queries, enabling low-supervision scaling to million-token contexts. Its key contributions are: (1) the first fine-grained, error-analyzable long-context QA benchmark specifically designed for scientific texts; and (2) multi-document aggregation and numerical reasoning tasks grounded in real citation networks. Experiments reveal severe performance limitations of state-of-the-art LLMs on SciTrek, with modest gains from fine-tuning and reinforcement learning—highlighting two critical bottlenecks: information localization and numerical computation.
📝 Abstract
This paper introduces SciTrek, a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles. Current long-context benchmarks often rely on non-scientific texts, focus on simple information retrieval tasks, or employ artificial contexts. SciTrek addresses these limitations by proposing complex questions that require information aggregation and synthesis across multiple full-text scientific articles. Questions and their ground-truth answers are automatically generated by formulating them as SQL queries over a database constructed from article metadata (titles, authors, and references). The SQL operations provide explicit, verifiable reasoning steps for fine-grained error analysis, and the construction process scales to contexts up to 1M tokens with minimal supervision. Extensive experiments on a diverse set of open-weight and proprietary LLMs demonstrate that SciTrek poses a significant challenge as the context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Our analysis reveals systematic shortcomings in models' abilities to perform basic numerical operations and accurately locate specific information in long contexts.