Diffusion-Pretrained Dense and Contextual Embeddings

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing global semantic modeling and computational efficiency in multilingual long-document retrieval by proposing a novel embedding method based on diffusion-based pretrained language models. The approach integrates document-level global context into paragraph representations through a late-chunking strategy and a context-aware bidirectional attention mechanism. High-quality dense vectors are further refined via multi-stage contrastive learning and mean pooling. The resulting model, pplx-embed-v1, achieves strong performance across multilingual and code retrieval benchmarks, including MTEB and MIRACL. Its contextual variant, pplx-embed-context-v1, sets a new state-of-the-art on ConTEB and demonstrates both efficiency and practicality in large-scale production environments with tens of millions of documents.

Technology Category

Application Category

📝 Abstract
In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, which focuses on real-world, large-scale search scenarios over tens of millions of documents. These results validate the models'effectiveness in production environments where retrieval quality and efficiency are critical at scale.
Problem

Research questions and friction points this paper is trying to address.

dense embedding
long-document retrieval
multilingual retrieval
contextual embedding
web-scale retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-pretrained
multi-stage contrastive learning
bidirectional context
late chunking
contextualized embeddings
🔎 Similar Papers
No similar papers found.
S
Sedigheh Eslami
Perplexity AI
M
Maksim Gaiduk
Perplexity AI
Markus Krimmel
Markus Krimmel
PhD Student, Max Planck Institute of Biochemistry
graph generationgeometric deep learning
L
Louis Milliken
Perplexity AI
Bo Wang
Bo Wang
Member of Technical Staff at Perplexity AI
information retrieval
D
Denis Bykov
Perplexity AI