🤖 AI Summary
To address the degradation in retrieval performance caused by contextual information loss in conventional text chunking for embedding, this paper proposes *late chunking*: chunking is performed after the top-layer Transformer output but before mean pooling, followed by token-level context-aware aggregation to generate chunk embeddings. This approach delays the chunking operation until after global contextual representations have been fully modeled, thereby inherently preserving long-range context without requiring model retraining—enabling plug-and-play adaptation to diverse long-context embedding models. Additionally, a lightweight fine-tuning strategy is introduced to enhance chunk-level discriminability. Extensive evaluation across multiple retrieval benchmarks demonstrates that late chunking significantly outperforms traditional early-chunking baselines, achieving consistent improvements in recall and relevance metrics. Crucially, the method maintains zero-shot deployability—requiring no training or architectural modification—while delivering state-of-the-art retrieval performance.
📝 Abstract
Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term late in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.