Diffusion-Pretrained Dense and Contextual Embeddings

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the challenge of balancing global semantic modeling and computational efficiency in multilingual long-document retrieval by proposing a novel embedding method based on diffusion-based pretrained language models. The approach integrates document-level global context into paragraph representations through a late-chunking strategy and a context-aware bidirectional attention mechanism. High-quality dense vectors are further refined via multi-stage contrastive learning and mean pooling. The resulting model, pplx-embed-v1, achieves strong performance across multilingual and code retrieval benchmarks, including MTEB and MIRACL. Its contextual variant, pplx-embed-context-v1, sets a new state-of-the-art on ConTEB and demonstrates both efficiency and practicality in large-scale production environments with tens of millions of documents.

Technology Category

Application Category

📝 Abstract

In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, which focuses on real-world, large-scale search scenarios over tens of millions of documents. These results validate the models'effectiveness in production environments where retrieval quality and efficiency are critical at scale.

Problem

Research questions and friction points this paper is trying to address.

dense embedding

long-document retrieval

multilingual retrieval

contextual embedding

web-scale retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-pretrained

multi-stage contrastive learning

bidirectional context