LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Existing narrative long-document QA benchmarks (e.g., NarrativeQA) suffer from high document noise and low-quality question-answer pairs, undermining evaluation reliability. Method: We introduce LiteraryQA—a rigorously curated high-quality subset—built via a human–LLM collaborative verification pipeline for data cleaning. We systematically evaluate automated metrics, analyzing their correlation with human judgments across diverse long-context LLMs. Contribution/Results: We find that conventional n-gram metrics exhibit weak correlation with human assessments, whereas lightweight open-source LLMs (e.g., Qwen2-7B) used as judges achieve system-level ranking consistency with human annotators. We publicly release the filtered dataset and an integrated evaluation framework, and conduct comprehensive benchmarking across multiple long-context LLMs. This work provides the first empirical validation of LLM-as-a-Judge for narrative QA evaluation—demonstrating both its effectiveness and computational efficiency.

Technology Category

Application Category

📝 Abstract

Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA. This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://github.com/SapienzaNLP/LiteraryQA.

Problem

Research questions and friction points this paper is trying to address.

Addressing unreliable NarrativeQA benchmark with noisy documents and flawed QA pairs

Creating LiteraryQA subset using human-LLM validated pipeline for quality improvement

Evaluating automatic metrics and benchmarking long-context LLMs on narrative QA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created human-LLM validated QA cleaning pipeline

Evaluated LLM-as-a-Judge metrics for system assessment

Benchmarked long-context LLMs on curated literary dataset

🔎 Similar Papers

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens