🤖 AI Summary
This study addresses the challenge of modeling similarity between narrative stories and learning effective narrative representations. The authors propose a triplet-based binary classification task grounded in narrative theory and human intuition: given an anchor story, the model determines which of two candidate stories is more similar to it. To support this task, they construct a high-quality dataset of human-annotated narrative triplets. Their approach integrates large language model (LLM) ensembles, fine-tuned pretrained embeddings, and pre- and post-processing strategies. In an evaluation involving 46 teams and 71 submissions, the LLM ensemble achieved the best performance on the classification task, while fine-tuning and preprocessing yielded comparable results for the embedding task. These findings suggest that automated narrative understanding still has considerable room for improvement.
📝 Abstract
We present the shared task on narrative similarity and narrative representation learning - NSNRL (pronounced "nass-na-rel"). The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story. We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment. Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations. We collected at least two annotations each for more than 1,000 story summary triples, with each annotation being backed by at least two annotators in agreement. This paper describes the sampling and annotation process for the dataset; further, we give an overview of the submitted systems and the techniques they employ. We received a total of 71 final submissions from 46 teams across our two tracks. In our triple-based classification setup, LLM ensembles make up many of the top-scoring systems, while in the embedding setup, systems with pre- and post-processing on pretrained embedding models perform about on par with custom fine-tuned solutions. Our analysis identifies potential headroom for improvement of automated systems in both tracks. The task website includes visualizations of embeddings alongside instance-level classification results for all teams.