NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

📅 2024-03-18

📈 Citations: 3

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing long-context benchmarks inadequately assess large language models’ (LLMs) deep comprehension of complex narratives. To address this, we introduce NovelQA—the first English novel question-answering benchmark explicitly designed for ultra-long texts (>200K tokens), targeting three core challenges: multi-hop reasoning, fine-grained detail localization, and robustness under extreme context lengths. NovelQA is constructed via fully human annotation, covering factual, inferential, and abstract question types while preserving narrative coherence and semantic complexity. Empirical evaluation reveals that state-of-the-art long-context LLMs exhibit substantial limitations in multi-hop reasoning and precise detail retrieval, underperforming human annotators by 32% average accuracy—exposing fundamental bottlenecks in long-range semantic modeling. NovelQA establishes a new paradigm and high-standard evaluation framework for assessing long-context understanding in LLMs.

Technology Category

Application Category

📝 Abstract

Recent advancements in Large Language Models (LLMs) have pushed the boundaries of natural language processing, especially in long-context understanding. However, the evaluation of these models' long-context abilities remains a challenge due to the limitations of current benchmarks. To address this gap, we introduce NovelQA, a benchmark tailored for evaluating LLMs with complex, extended narratives. Constructed from English novels, NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding in LLMs. This paper details the design and construction of NovelQA, focusing on its comprehensive manual annotation process and the variety of question types aimed at evaluating nuanced comprehension. Our evaluation of long-context LLMs on NovelQA reveals significant insights into their strengths and weaknesses. Notably, the models struggle with multi-hop reasoning, detail-oriented questions, and handling extremely long inputs, with average lengths exceeding 200,000 tokens. Results highlight the need for substantial advancements in LLMs to enhance their long-context comprehension and contribute effectively to computational literary analysis.

Problem

Research questions and friction points this paper is trying to address.

Evaluating long-context understanding in LLMs with current benchmarks

Assessing LLMs' performance on complex, extended narratives over 200K tokens

Identifying LLMs' weaknesses in multi-hop reasoning and detail-oriented questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

NovelQA benchmark for long-context LLMs

Manual annotation for nuanced comprehension

Evaluation of 200K+ token documents

🔎 Similar Papers

DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels