BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing dense retrieval models predominantly rely on binary relevance labels, failing to capture the continuous, fine-grained relevance inherent in real-world scenarios. Method: We propose BiXSE, the first framework to directly model LLM-generated graded relevance scores as soft probabilistic supervision signals for end-to-end optimization of sentence embeddings. It adopts a pointwise training paradigm with binary cross-entropy loss and in-batch negatives, enabling highly informative supervision from a single query–document pair and drastically reducing annotation and computational overhead. Contribution/Results: BiXSE consistently outperforms the InfoNCE baseline across MMTEB, BEIR, and TREC-DL benchmarks, matching or exceeding the performance of strong pairwise ranking models. This demonstrates both the effectiveness and practicality of fine-grained LLM supervision for dense retrieval.

Technology Category

Application Category

📝 Abstract

Neural sentence embedding models for dense retrieval typically rely on binary relevance labels, treating query-document pairs as either relevant or irrelevant. However, real-world relevance often exists on a continuum, and recent advances in large language models (LLMs) have made it feasible to scale the generation of fine-grained graded relevance labels. In this work, we propose BiXSE, a simple and effective pointwise training method that optimizes binary cross-entropy (BCE) over LLM-generated graded relevance scores. BiXSE interprets these scores as probabilistic targets, enabling granular supervision from a single labeled query-document pair per query. Unlike pairwise or listwise losses that require multiple annotated comparisons per query, BiXSE achieves strong performance with reduced annotation and compute costs by leveraging in-batch negatives. Extensive experiments across sentence embedding (MMTEB) and retrieval benchmarks (BEIR, TREC-DL) show that BiXSE consistently outperforms softmax-based contrastive learning (InfoNCE), and matches or exceeds strong pairwise ranking baselines when trained on LLM-supervised data. BiXSE offers a robust, scalable alternative for training dense retrieval models as graded relevance supervision becomes increasingly accessible.

Problem

Research questions and friction points this paper is trying to address.

Improving dense retrieval with probabilistic graded relevance labels

Reducing annotation costs via single query-document pair supervision

Enhancing performance over binary relevance and contrastive learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses probabilistic graded relevance distillation

Optimizes binary cross-entropy with LLM-generated scores

Leverages in-batch negatives to reduce costs

🔎 Similar Papers

No similar papers found.