๐ค AI Summary
This study addresses the challenge that large language models (LLMs) struggle to effectively integrate global information when processing long-form texts, particularly exhibiting divergent narrative focus compared to humans in novel summarization. The authors present the first systematic comparison between human-written summaries and those generated by nine state-of-the-art LLMs. By aligning summaries to source chapters at the sentence level and analyzing model attention patterns to assess conceptual engagement, they reveal a pronounced model bias toward text endings and a significant discrepancy in narrative emphasis relative to human judgments. The work contributes a novel alignment-based analytical framework, uncovers this critical positional bias, and introduces the first alignable dataset of chapterโsummary pairs for novels, aiming to advance research in long-context comprehension.
๐ Abstract
Although LLM context lengths have grown, there is evidence that their ability to integrate information across long-form texts has not kept pace. We evaluate one such understanding task: generating summaries of novels. When human authors of summaries compress a story, they reveal what they consider narratively important. Therefore, by comparing human and LLM-authored summaries, we can assess whether models mirror human patterns of conceptual engagement with texts. To measure conceptual engagement, we align sentences from 150 human-written novel summaries with the specific chapters they reference. We demonstrate the difficulty of this alignment task, which indicates the complexity of summarization as a task. We then generate and align additional summaries by nine state-of-the-art LLMs for each of the 150 reference texts. Comparing the human and model-authored summaries, we find both stylistic differences between the texts and differences in how humans and LLMs distribute their focus throughout a narrative, with models emphasizing the ends of texts. Comparing human narrative engagement with model attention mechanisms suggests explanations for degraded narrative comprehension and targets for future development. We release our dataset to support future research.