🤖 AI Summary
This work systematically compares knowledge distillation (KD) and contrastive learning (CL) for training text re-ranking cross-encoders. We conduct controlled experiments across diverse student model architectures and sizes, under both in-domain and zero-shot cross-domain settings, using consistent data splits and evaluation protocols; notably, we employ state-of-the-art contrastive models as high-capacity teachers to establish strong distillation pathways. Results show that KD significantly outperforms CL when a powerful teacher is available, with consistent gains across all student variants; conversely, CL remains robust in the absence of a suitable teacher. Crucially, this study is the first to empirically demonstrate—within the re-ranking task—that teacher model capacity is a decisive factor governing KD effectiveness. Our findings provide reproducible, generalizable evidence and concrete guidance for selecting training strategies in practical deployment scenarios.
📝 Abstract
Training text rerankers is crucial for information retrieval. Two primary strategies are widely used: contrastive learning (optimizing directly on ground-truth labels) and knowledge distillation (transferring knowledge from a larger reranker). While both have been studied in the literature, a clear comparison of their effectiveness for training cross-encoder rerankers under practical conditions is needed.
This paper empirically compares these strategies by training rerankers of different sizes and architectures using both methods on the same data, with a strong contrastive learning model acting as the distillation teacher. Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a larger teacher model. This finding is consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity does not provide the same advantage, particularly for out-of-domain tasks. These findings offer practical guidance for choosing a training strategy based on available teacher models. Therefore, we recommend using knowledge distillation to train smaller rerankers if a larger, more powerful teacher is accessible; in its absence, contrastive learning provides a strong and more reliable alternative otherwise.