Reproducing and Comparing Distillation Techniques for Cross-Encoders

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This study addresses the lack of systematic comparison among knowledge distillation strategies and supervision objectives for cross-encoders under a unified experimental setup, which has hindered the identification of robust and effective design choices. For the first time, we systematically evaluate distillation methods leveraging large language models and ensembles of cross-encoder teachers within a consistent framework encompassing mainstream backbone architectures—including BERT, RoBERTa, ELECTRA, DeBERTa-v3, and ModernBERT—and compare pointwise versus pairwise ranking losses such as MarginMSE and InfoNCE. Our results demonstrate that pairwise ranking objectives consistently and significantly outperform pointwise baselines across all backbones and evaluation benchmarks, with performance gains comparable to those achieved through model scaling, thereby confirming their broad effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract

Recent advances in Information Retrieval have established transformer-based cross-encoders as a keystone in IR. Recent studies have focused on knowledge distillation and showed that, with the right strategy, traditional cross-encoders could reach the level of effectiveness of LLM re-rankers. Yet, comparisons with previous training strategies, including distillation from strong cross-encoder teachers, remain unclear. In addition, few studies cover a similar range of backbone encoders, while substantial improvements have been made in this area since BERT. This lack of comprehensive studies in controlled environments makes it difficult to identify robust design choices. In this work, we reproduce \citet{schlattRankDistiLLMClosingEffectiveness2025} LLM-based distillation strategy and compare it to \citet{hofstatterImprovingEfficientNeural2020} approach based on an ensemble of cross-encoder teachers, as well as other supervised objectives, to fine-tune a large range of cross-encoders, from the original BERT and its follow-ups RoBERTa, ELECTRA and DeBERTa-v3, to the more recent ModernBERT. We evaluate all models on both in-domain (TREC-DL and MS~MARCO dev) and out-of-domain datasets (BEIR, LoTTE, and Robust04). Our results show that objectives emphasizing relative comparisons -- pairwise MarginMSE and listwise InfoNCE -- consistently outperform pointwise baselines across all backbones and evaluation settings, and that objective choice can yield gains comparable to scaling the backbone architecture.

Problem

Research questions and friction points this paper is trying to address.

knowledge distillation

cross-encoders

training objectives

information retrieval

model comparison

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation

cross-encoder

relative ranking objectives