PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the challenge of evaluating large language models’ global comprehension and deep reasoning capabilities over long texts. We propose PRELUDE, a novel benchmark task that assesses model performance by determining the consistency between prequel narratives and canonical storylines—requiring cross-paragraph semantic integration, indirect evidence aggregation, and logical consistency inference. PRELUDE introduces the first prequel consistency verification paradigm; 88% of its instances necessitate multi-span collaborative reasoning, substantially increasing coherence modeling difficulty. Experiments reveal that state-of-the-art LLMs underperform humans by over 30 percentage points in reasoning accuracy and lag by more than 15 points overall, exposing fundamental limitations in deep semantic integration and logical validation. Leveraging in-context learning, retrieval-augmented generation (RAG), domain-specific fine-tuning, and commercial DeepResearch services, this study establishes a new benchmark and offers methodological insights for investigating long-context cognitive mechanisms.

Technology Category

Application Category

📝 Abstract

We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating long-context understanding through prequel consistency checks

Requiring global comprehension and deep reasoning over narratives

Assessing plausibility by integrating indirectly related information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for long-context global comprehension

Prequel consistency evaluation requiring integrated reasoning

Human performance gap analysis in reasoning accuracy

🔎 Similar Papers

NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?