🤖 AI Summary
This study addresses the lack of systematic evaluation and inconsistent benchmarks in existing document chunking strategies for dense retrieval. The authors propose the first two-dimensional taxonomy that encompasses structural, semantic-aware, and large language model (LLM)-guided chunking approaches, along with embedding timing considerations. They establish a unified reproducible framework to comprehensively evaluate diverse strategies—including fixed-length, paragraph-level, LumberChunker, and Late Chunking—across both within-document and corpus-level retrieval tasks. Their findings reveal that structural chunking outperforms LLM-based methods in corpus retrieval, while LumberChunker achieves the best performance in within-document retrieval. Notably, contextualized chunking improves corpus retrieval effectiveness but degrades within-document performance, highlighting a task-dependent trade-off that informs optimal chunking selection.
📝 Abstract
Document chunking is a critical preprocessing step in dense retrieval systems, yet the design space of chunking strategies remains poorly understood. Recent research has proposed several concurrent approaches, including LLM-guided methods (e.g., DenseX and LumberChunker) and contextualized strategies(e.g., Late Chunking), which generate embeddings before segmentation to preserve contextual information. However, these methods emerged independently and were evaluated on benchmarks with minimal overlap, making direct comparisons difficult.
This paper reproduces prior studies in document chunking and presents a systematic framework that unifies existing strategies along two key dimensions: (1) segmentation methods, including structure-based methods (fixed-size, sentence-based, and paragraph-based) as well as semantically-informed and LLM-guided methods; and (2) embedding paradigms, which determine the timing of chunking relative to embedding (pre-embedding chunking vs. contextualized chunking). Our reproduction evaluates these approaches in two distinct retrieval settings established in previous work: in-document retrieval (needle-in-a-haystack) and in-corpus retrieval (the standard information retrieval task).
Our comprehensive evaluation reveals that optimal chunking strategies are task-dependent: simple structure-based methods outperform LLM-guided alternatives for in-corpus retrieval, while LumberChunker performs best for in-document retrieval. Contextualized chunking improves in-corpus effectiveness but degrades in-document retrieval. We also find that chunk size correlates moderately with in-document but weakly with in-corpus effectiveness, suggesting segmentation method differences are not purely driven by chunk size. Our code and evaluation benchmarks are publicly available at (Anonymoused).