CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limited ability of current large language models to perform global reasoning across entire document corpora within million-token contexts, a gap left unaddressed by existing benchmarks that focus on single long documents or rely on sparse retrieval. To this end, the authors propose CorpusQA—the first synthetic benchmark designed specifically for corpus-level reasoning—comprising up to 10 million tokens of unstructured text. Leveraging a programmatic framework that decouples reasoning from text representation, CorpusQA generates complex queries requiring global integration, cross-document comparison, and statistical aggregation, with verifiable answers. Experiments reveal that mainstream long-context models suffer significant performance degradation as input length increases, and conventional retrieval-augmented approaches fail to scale effectively. In contrast, memory-augmented agent architectures demonstrate superior capability in synthesizing global information, underscoring the benchmark’s validity and challenge.

Technology Category

Application Category

📝 Abstract

While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a"sparse retrieval"assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast, unstructured text without relying on fallible human annotation. We further demonstrate the utility of our framework beyond evaluation, showing that fine-tuning on our synthesized data effectively enhances an LLM's general long-context reasoning capabilities. Extensive experiments reveal that even state-of-the-art long-context LLMs struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Our findings indicate that memory-augmented agentic architectures offer a more robust alternative, suggesting a critical shift is needed from simply extending context windows to developing advanced architectures for global information synthesis.

Problem

Research questions and friction points this paper is trying to address.

corpus-level reasoning

long-context language models

evidence dispersion

global information synthesis

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

CorpusQA

corpus-level reasoning

synthetic data generation