Distillation versus Contrastive Learning: How to Train Your Rerankers

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically compares knowledge distillation (KD) and contrastive learning (CL) for training text re-ranking cross-encoders. We conduct controlled experiments across diverse student model architectures and sizes, under both in-domain and zero-shot cross-domain settings, using consistent data splits and evaluation protocols; notably, we employ state-of-the-art contrastive models as high-capacity teachers to establish strong distillation pathways. Results show that KD significantly outperforms CL when a powerful teacher is available, with consistent gains across all student variants; conversely, CL remains robust in the absence of a suitable teacher. Crucially, this study is the first to empirically demonstrate—within the re-ranking task—that teacher model capacity is a decisive factor governing KD effectiveness. Our findings provide reproducible, generalizable evidence and concrete guidance for selecting training strategies in practical deployment scenarios.

Technology Category

Application Category

📝 Abstract
Training text rerankers is crucial for information retrieval. Two primary strategies are widely used: contrastive learning (optimizing directly on ground-truth labels) and knowledge distillation (transferring knowledge from a larger reranker). While both have been studied in the literature, a clear comparison of their effectiveness for training cross-encoder rerankers under practical conditions is needed. This paper empirically compares these strategies by training rerankers of different sizes and architectures using both methods on the same data, with a strong contrastive learning model acting as the distillation teacher. Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a larger teacher model. This finding is consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity does not provide the same advantage, particularly for out-of-domain tasks. These findings offer practical guidance for choosing a training strategy based on available teacher models. Therefore, we recommend using knowledge distillation to train smaller rerankers if a larger, more powerful teacher is accessible; in its absence, contrastive learning provides a strong and more reliable alternative otherwise.
Problem

Research questions and friction points this paper is trying to address.

Compare effectiveness of contrastive learning and knowledge distillation for rerankers
Evaluate performance across different model sizes and architectures
Provide guidance on training strategy based on teacher model availability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares contrastive learning and knowledge distillation
Uses larger model as teacher for distillation
Recommends distillation for smaller rerankers