🤖 AI Summary
This work addresses the limitations of existing Retrieval-Augmented Generation (RAG) systems in handling literary texts, which often suffer from fragmented chunking and ambiguous coreference due to their disregard for complex narrative structures, thereby degrading retrieval and generation quality. To overcome this, the authors propose LitSeg, a novel framework that integrates narratological theory into document segmentation for the first time. LitSeg employs multi-stage large language model prompting to extract narrative events, disambiguate storylines, and identify structural turning points, enabling structure-aware text chunking. Furthermore, they introduce LitSeg-Lite, a lightweight variant based on knowledge distillation that supports efficient single-pass inference. Experimental results demonstrate that the proposed approach significantly improves retrieval accuracy, contextual relevance, and downstream question-answering performance, validating the efficacy of narrative-guided segmentation and model distillation.
📝 Abstract
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, particularly for long-tail domains such as literary works. However, the critical step of document segmentation in RAG remains largely underexplored. Existing strategies are typically semantically blind and overlook the complicated narrative structures of literary works, often resulting in fragmented plots and unclear references that severely hinder retrieval and generation performance. To address this, we propose LitSeg, a novel narrative-theory-guided segmentation framework. By employing multi-stage prompting, LitSeg explicitly extracts valid events, untangles narrative threads, clarifies narrative structures, and locates turning points to inform segmentation. To alleviate the computational overhead of multi-stage inference with large-scale models, we further introduce LitSeg-Lite, a lightweight single-pass chunker fine-tuned on LitSeg-generated data via a two-stage training strategy, distilling the complex process into a single inference pass. Extensive experiments demonstrate that with structurally independent text chunks, our methods significantly improve retrieval accuracy and context relevance over baselines, ultimately enhancing downstream QA performance, while ablation studies validate the efficacy of narratological guidance and data distillation.