SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

๐Ÿ“… 2025-08-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address semantic fragmentation caused by chunking and the inability of existing embedding models to effectively encode long-context documents in long-document RAG, this paper proposes SitEmb, a contextualized embedding model. SitEmb introduces a conditional encoding mechanism and a context-aware training paradigm, enabling short text chunks to be jointly modeled within broad document contextsโ€”thereby preserving both local evidence interpretability and global semantic coherence. Built upon the BGE-M3 architecture, SitEmb supports multi-scale dense retrieval. On our curated book-plot retrieval benchmark, it substantially outperforms state-of-the-art models with an order-of-magnitude larger parameter count: the 1B-parameter SitEmb-v1 surpasses 7โ€“8B models, while the 8B-parameter SitEmb-v1.5 achieves over 10% further improvement. Moreover, SitEmb demonstrates strong cross-lingual transferability and robust generalization across downstream tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.
Problem

Research questions and friction points this paper is trying to address.

Enhancing retrieval performance for short text chunks with broader context
Addressing limitations of embedding models in encoding situated context
Improving semantic association and long document comprehension in RAG
Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-aware dense retrieval for semantic association
Situated embedding models (SitEmb) training paradigm
Improved performance with 8B SitEmb-v1.5 model
๐Ÿ”Ž Similar Papers
No similar papers found.