🤖 AI Summary
This work addresses the challenge of evaluating large language models’ global comprehension and deep reasoning capabilities over long texts. We propose PRELUDE, a novel benchmark task that assesses model performance by determining the consistency between prequel narratives and canonical storylines—requiring cross-paragraph semantic integration, indirect evidence aggregation, and logical consistency inference. PRELUDE introduces the first prequel consistency verification paradigm; 88% of its instances necessitate multi-span collaborative reasoning, substantially increasing coherence modeling difficulty. Experiments reveal that state-of-the-art LLMs underperform humans by over 30 percentage points in reasoning accuracy and lag by more than 15 points overall, exposing fundamental limitations in deep semantic integration and logical validation. Leveraging in-context learning, retrieval-augmented generation (RAG), domain-specific fine-tuning, and commercial DeepResearch services, this study establishes a new benchmark and offers methodological insights for investigating long-context cognitive mechanisms.
📝 Abstract
We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.