NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing large language models (LLMs) lack systematic evaluation of multi-hop reasoning (k = 1–4) under long contexts (64K–128K tokens), as mainstream benchmarks assess context length or hop count in isolation and neglect natural narrative coherence. Method: We introduce the first multi-hop question-answering benchmark grounded in full-length novels, featuring keyword-guided hop-by-hop chain construction, oracle-context filtering, and human-validated hop alignment to jointly control context length and reasoning depth within authentic narrative settings. Contribution/Results: Experiments reveal substantial performance degradation in state-of-the-art models with increasing hop count and context length; dominant failure modes include missing final-hop integration and long-range semantic drift. Our benchmark establishes a reproducible, diagnosable evaluation paradigm for long-context multi-hop reasoning—enabling fine-grained analysis of model capabilities across both dimensions in realistic scenarios.

Technology Category

Application Category

📝 Abstract

Current large language models (LLMs) struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. While prior benchmarks explore long-context comprehension or multi-hop reasoning in isolation, none jointly vary context length and reasoning depth in natural narrative settings. We introduce NovelHopQA, the first benchmark to evaluate k1-4 hop QA over 64k-128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. We evaluate six state-of-the-art (SOTA) models and apply oracle-context filtering to ensure all questions are genuinely answerable. Human annotators validate both alignment and hop depth. We noticed consistent accuracy drops with increased hops and context length, even in frontier models-revealing that sheer scale does not guarantee robust reasoning. Our failure mode analysis highlights common breakdowns, such as missed final-hop integration and long-range drift. NovelHopQA offers a controlled diagnostic setting to stress-test multi-hop reasoning at scale.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-hop reasoning in long narrative contexts

Assessing LLM performance on 64k-128k token excerpts

Diagnosing accuracy drops with increased hops and context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Keyword-guided pipeline builds hop-separated chains

Oracle-context filtering ensures answerable questions

Retrieval-augmented generation tests model performance

🔎 Similar Papers

Do Large Language Models Latently Perform Multi-Hop Reasoning?