🤖 AI Summary
This study investigates the generalization limitations of multi-vector retrieval models—such as ColBERT-v2 and ConstBERT—in non-standard scenarios, particularly long narrative queries. Through ablation studies, cross-backend deployment, diverse query distribution testing, and large-scale fine-tuning, the authors achieve a reproducibility error below 0.05% MRR@10 on MS MARCO. However, performance drops drastically by 86–97% on the TREC ToT 2025 long-query benchmark, with data augmentation even causing up to 29% further degradation. The work identifies the MaxSim operator’s uniform token weighting as a key culprit, which fails to distinguish signal from noise, thereby exposing an inherent architectural limitation in multi-vector models that cannot be overcome through fine-tuning alone.
📝 Abstract
Reproducibility must validate architectural robustness, not just numerical accuracy. We evaluate ColBERT-v2 and ConstBERT across five dimensions, finding that while ConstBERT reproduces within 0.05% MRR@10 on MS-MARCO, both models show a drop of 86-97% on long, narrative queries (TREC ToT 2025). Ablations prove this failure is architectural: performance plateaus at 20 words because the MaxSim operator's uniform token weighting cannot distinguish signal from filler noise. Furthermore, undocumented backend parameters create an 8-point gap due to ConstBERT's sparse centroid coverage, and fine-tuning with 3x more data actually degrades performance by up to 29%. We conclude that architectural constraints in multi-vector retrieval cannot be overcome by adaptation alone. Code: https://github.com/utshabkg/multi-vector-reproducibility.