BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dense retrieval models predominantly rely on binary relevance labels, failing to capture the continuous, fine-grained relevance inherent in real-world scenarios. Method: We propose BiXSE, the first framework to directly model LLM-generated graded relevance scores as soft probabilistic supervision signals for end-to-end optimization of sentence embeddings. It adopts a pointwise training paradigm with binary cross-entropy loss and in-batch negatives, enabling highly informative supervision from a single query–document pair and drastically reducing annotation and computational overhead. Contribution/Results: BiXSE consistently outperforms the InfoNCE baseline across MMTEB, BEIR, and TREC-DL benchmarks, matching or exceeding the performance of strong pairwise ranking models. This demonstrates both the effectiveness and practicality of fine-grained LLM supervision for dense retrieval.

Technology Category

Application Category

📝 Abstract
Neural sentence embedding models for dense retrieval typically rely on binary relevance labels, treating query-document pairs as either relevant or irrelevant. However, real-world relevance often exists on a continuum, and recent advances in large language models (LLMs) have made it feasible to scale the generation of fine-grained graded relevance labels. In this work, we propose BiXSE, a simple and effective pointwise training method that optimizes binary cross-entropy (BCE) over LLM-generated graded relevance scores. BiXSE interprets these scores as probabilistic targets, enabling granular supervision from a single labeled query-document pair per query. Unlike pairwise or listwise losses that require multiple annotated comparisons per query, BiXSE achieves strong performance with reduced annotation and compute costs by leveraging in-batch negatives. Extensive experiments across sentence embedding (MMTEB) and retrieval benchmarks (BEIR, TREC-DL) show that BiXSE consistently outperforms softmax-based contrastive learning (InfoNCE), and matches or exceeds strong pairwise ranking baselines when trained on LLM-supervised data. BiXSE offers a robust, scalable alternative for training dense retrieval models as graded relevance supervision becomes increasingly accessible.
Problem

Research questions and friction points this paper is trying to address.

Improving dense retrieval with probabilistic graded relevance labels
Reducing annotation costs via single query-document pair supervision
Enhancing performance over binary relevance and contrastive learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses probabilistic graded relevance distillation
Optimizes binary cross-entropy with LLM-generated scores
Leverages in-batch negatives to reduce costs
🔎 Similar Papers
No similar papers found.