Mitigating Preference Leakage via Strict Estimator Separation for Normative Generative Ranking

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the issue of preference leakage in generative information retrieval, where overlapping supervision and evaluation models compromise assessment reliability—particularly in normative ranking tasks such as cultural relevance. The study formally characterizes this problem for the first time and introduces a leakage-free dual-judge framework that strictly separates the supervisory model (Judge B) from the evaluation model (Judge A). It further distills fine-grained cultural preferences from a Cross-Encoder into the efficient dense encoder BGE-M3 via knowledge distillation. Evaluated on a newly constructed NGR-33k cultural stories benchmark and the Moral Stories dataset, the distilled BGE-M3 significantly outperforms the original Cross-Encoder under leakage-free evaluation and demonstrates strong alignment with human normative judgments.

Technology Category

Application Category

📝 Abstract

In Generative Information Retrieval (GenIR), the bottleneck has shifted from generation to the selection of candidates, particularly for normative criteria such as cultural relevance. Current LLM-as-a-Judge evaluations often suffer from circularity and preference leakage, where overlapping supervision and evaluation models inflate performance. We address this by formalising cultural relevance as a within-query ranking task and introducing a leakage-free two-judge framework that strictly separates supervision (Judge B) from evaluation (Judge A). On a new benchmark of 33,052 (NGR-33k) culturally grounded stories, we find that while classical baselines yield only modest gains, a dense bi-encoder distilled from a Judge-B-supervised Cross-Encoder is highly effective. Although the Cross-Encoder provides a strong supervision signal for distillation, the distilled BGE-M3 model substantially outperforms it under leakage-free Judge~A evaluation. We validate our framework on the human-curated Moral Stories dataset, showing strong alignment with human norms. Our results demonstrate that rigorous evaluator separation is a prerequisite for credible GenIR evaluation, proving that subtle cultural preferences can be distilled into efficient rankers without leakage.

Problem

Research questions and friction points this paper is trying to address.

Preference Leakage

Generative Information Retrieval

LLM-as-a-Judge

Cultural Relevance

Evaluation Bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

preference leakage

strict estimator separation

generative information retrieval