🤖 AI Summary
This work investigates whether multi-vector retrieval models inherently require large-scale pretraining or can achieve competitive performance solely through knowledge distillation from strong single-vector models. To address this, we introduce ColBERT-Zero, a model pretrained end-to-end exclusively on publicly available data, and conduct a systematic analysis of the interplay among pretraining, supervised fine-tuning, and knowledge distillation. We demonstrate for the first time that large-scale multi-vector pretraining significantly outperforms distilled-only counterparts—even when the latter leverage stronger but proprietary data—and highlight the critical importance of aligning pretraining and fine-tuning configurations. Under comparable model scales, ColBERT-Zero establishes a new state of the art for multi-vector retrieval trained solely on public data, surpassing both GTE-ModernColBERT and its base encoder, GTE-ModernBERT.
📝 Abstract
Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.