🤖 AI Summary
To address the inefficiency of dual-encoder retrieval models—namely, their reliance on teacher models, complex batch sampling strategies, and low training throughput—this paper proposes a parameter-free self-distillation loss. Our method eliminates the need for external teachers or explicit hard negative sampling by leveraging the intrinsic semantic capabilities of pretrained language models to perform implicit hard negative mining and self-supervised optimization. Furthermore, we introduce an adaptive relevance margin to enhance representation discriminability. Empirically, our approach achieves performance on par with teacher-based distillation baselines using only 13.5% of the training data, while accelerating training by 3–15×. All code and datasets are publicly released.
📝 Abstract
Representation-based retrieval models, so-called biencoders, estimate the relevance of a document to a query by calculating the similarity of their respective embeddings. Current state-of-the-art biencoders are trained using an expensive training regime involving knowledge distillation from a teacher model and batch-sampling. Instead of relying on a teacher model, we contribute a novel parameter-free loss function for self-supervision that exploits the pre-trained language modeling capabilities of the encoder model as a training signal, eliminating the need for batch sampling by performing implicit hard negative mining. We investigate the capabilities of our proposed approach through extensive ablation studies, demonstrating that self-distillation can match the effectiveness of teacher distillation using only 13.5% of the data, while offering a speedup in training time between 3x and 15x compared to parametrized losses. Code and data is made openly available.