Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

📅 2024-09-07

🏛️ arXiv.org

📈 Citations: 12

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address the degradation in retrieval performance caused by contextual information loss in conventional text chunking for embedding, this paper proposes *late chunking*: chunking is performed after the top-layer Transformer output but before mean pooling, followed by token-level context-aware aggregation to generate chunk embeddings. This approach delays the chunking operation until after global contextual representations have been fully modeled, thereby inherently preserving long-range context without requiring model retraining—enabling plug-and-play adaptation to diverse long-context embedding models. Additionally, a lightweight fine-tuning strategy is introduced to enhance chunk-level discriminability. Extensive evaluation across multiple retrieval benchmarks demonstrates that late chunking significantly outperforms traditional early-chunking baselines, achieving consistent improvements in recall and relevance metrics. Crucially, the method maintains zero-shot deployability—requiring no training or architectural modification—while delivering state-of-the-art retrieval performance.

Technology Category

Application Category

📝 Abstract

Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term late in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.

Problem

Research questions and friction points this paper is trying to address.

Retrieving smaller text portions loses contextual information.

Chunk embeddings lack surrounding context, reducing effectiveness.

Need method to embed long text before chunking.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Late chunking after transformer embedding

Utilizes long-context embedding models

Fine-tuning enhances retrieval effectiveness

🔎 Similar Papers

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling