The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Current computational literary studies are hindered by the scarcity of annotated long-form literary texts—particularly those exhibiting complex coreference chains—making robust evaluation of coreference resolution models difficult. To address this, we introduce the first coreference resolution corpus built over complete French novels, comprising full-length original texts of three works, thereby overcoming the long-document data bottleneck. We propose a modular processing pipeline that integrates contextualized representation learning with explicit referential chain modeling, specifically designed for long-distance and multi-hop coreference. Our method achieves strong performance on lengthy documents, enables fine-grained error analysis, and successfully generalizes to downstream literary NLP tasks—including fictional character gender inference—demonstrating both practical utility and scalability in computational literary analysis.

Technology Category

Application Category

📝 Abstract

While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.

Problem

Research questions and friction points this paper is trying to address.

Addressing coreference resolution challenges in lengthy French novels

Providing annotated corpus for evaluating long reference chains

Developing scalable pipeline for literary analysis and gender inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotated corpus of three full-length French novels

Modular coreference resolution pipeline for analysis

Scales effectively to long documents with chains

🔎 Similar Papers

Says Who? Effective Zero-Shot Annotation of Focalization