Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing contrastive learning paradigms (e.g., InfoNCE) rely on binary relevance labels, treating all unlabeled documents as equally negative and thus ignoring fine-grained relevance distinctions while remaining vulnerable to label noise. To address this, we propose a fully synthetic multi-level ranking training framework: it replaces human-annotated documents with synthetically generated query–document pairs—each assigned explicit relevance grades—using open-source large language models. We further introduce a list-wise Wasserstein loss that directly models distributional distances between ranked document sequences, aligning with the ordinal nature of ranking. This is the first multi-level relevance training method that requires no real document annotations whatsoever. Experiments demonstrate that our approach significantly outperforms InfoNCE across multiple IR benchmarks, achieves superior zero-shot transfer performance on BEIR, matches fully supervised methods in effectiveness, and exhibits greater robustness to distributional shift.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have allowed the augmentation of information retrieval (IR) pipelines with synthetic data in various ways. Yet, the main training paradigm remains: contrastive learning with binary relevance labels and the InfoNCE loss, where one positive document is compared against one or more negatives. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus (a) missing subtle nuances that are useful for ranking and (b) being susceptible to annotation noise. To overcome this limitation, in this work we forgo real training documents and annotations altogether and use open-source LLMs to directly generate synthetic documents that answer real user queries according to several different levels of relevance. This fully synthetic ranking context of graduated relevance, together with an appropriate list-wise loss (Wasserstein distance), enables us to train dense retrievers in a way that better captures the ranking task. Experiments on various IR datasets show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents for training, our dense retriever significantly outperforms the same retriever trained through self-supervision. More importantly, it matches the performance of the same retriever trained on real, labeled training documents of the same dataset, while being more robust to distribution shift and clearly outperforming it when evaluated zero-shot on the BEIR dataset collection.

Problem

Research questions and friction points this paper is trying to address.

Overcomes binary relevance limitations in IR training

Uses synthetic data for multi-level relevance ranking

Enhances robustness and performance in dense retrievers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses synthetic data for list-wise training

Employs multiple relevance levels for documents

Applies Wasserstein distance as loss function

🔎 Similar Papers

No similar papers found.