🤖 AI Summary
This work systematically investigates how key training factors—embedding architecture, loss functions, sampling strategies, hard example mining, learning rate scheduling, and batch size—affect retrieval accuracy in image retrieval models. Through over ten thousand controlled experiments across multiple benchmark datasets (GLDv2, ROxford, RParis), we construct the first comprehensive influence map of training components for image retrieval. Our analysis reveals a universally effective configuration: learning rate warmup followed by cosine annealing, intra-class uniform sampling, and progressive hard example mining. This combination improves mean Average Precision (mAP) by up to 8.2% across standard benchmarks, substantially outperforming empirically guided practices. We further propose a transferable, end-to-end training guideline grounded in empirical evidence and open-source a scalable distributed training framework with full implementation. The framework and guidelines have been widely adopted in both academia and industry, enabling reproducible, high-performance image retrieval model training.
📝 Abstract
Image retrieval is the task of finding images in a database that are most similar to a given query image. The performance of an image retrieval pipeline depends on many training-time factors, including the embedding model architecture, loss function, data sampler, mining function, learning rate(s), and batch size. In this work, we run tens of thousands of training runs to understand the effect each of these factors has on retrieval accuracy. We also discover best practices that hold across multiple datasets. The code is available at https://github.com/gmberton/image-retrieval