Literary Evidence Retrieval via Long-Context Language Models

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work investigates the deep literary comprehension capabilities of long-context language models through a novel task—Literary Evidence Retrieval: given an entire novel (e.g., *The Great Gatsby*) and citation-free literary criticism, models must precisely locate and generate the corresponding verbatim textual passage. The task jointly demands global narrative reasoning and close-reading analysis. We construct REliC-Lit, a high signal-to-noise subset of 292 instances, enabling the first systematic evaluation of long-context LLMs on literary close reading. Evaluation employs human-validated ground truth with dual metrics: accuracy and over-generation rate. Results show Gemini Pro 2.5 achieves 62.5% accuracy—significantly surpassing domain-expert human performance (50%)—while the best open-weight model attains only 29.1%. All models exhibit consistent weaknesses in interpreting literary signals, particularly metaphor, temporal structure, and context-dependent semantics.

Technology Category

Application Category

📝 Abstract

How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of That et al. (2022) to construct a benchmark where the entire text of a primary source (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work. This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination. We curate a high-quality subset of 292 examples through extensive filtering and human verification. Our experiments show that recent reasoning models, such as Gemini Pro 2.5 can exceed human expert performance (62.5% vs. 50% accuracy). In contrast, the best open-weight model achieves only 29.1% accuracy, highlighting a wide gap in interpretive reasoning between open and closed-weight models. Despite their speed and apparent accuracy, even the strongest models struggle with nuanced literary signals and overgeneration, signaling open challenges for applying LLMs to literary analysis. We release our dataset and evaluation code to encourage future work in this direction.

Problem

Research questions and friction points this paper is trying to address.

Evaluating long-context LLMs' understanding of literary fiction

Retrieving missing quotations from literary criticism using full texts

Assessing models' global narrative and close textual reasoning skills

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses long-context language models for retrieval

Benchmarks with entire literary texts

Evaluates global and close reading skills

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval