🤖 AI Summary
To address the limited robustness of models in content-based image retrieval (CBIR) and their suboptimal performance in cross-domain and fine-grained retrieval, this paper proposes an uncertainty-driven evidential Transformer. It introduces evidential deep learning—previously unexplored in deep metric learning—enabling interpretable uncertainty quantification as a principled alternative to conventional multi-class classification. The method integrates Global Context Vision Transformer (GC ViT) to capture holistic contextual dependencies for discriminative feature representation. Evaluated on Stanford Online Products and CUB-200-2011, the approach establishes new state-of-the-art results across all standard retrieval protocols—including recall@K, NMI, and F1-score—demonstrating substantial improvements in retrieval reliability, cross-domain generalization, and fine-grained discrimination capability.
📝 Abstract
We introduce the Evidential Transformer, an uncertainty-driven transformer model for improved and robust image retrieval. In this paper, we make several contributions to content-based image retrieval (CBIR). We incorporate probabilistic methods into image retrieval, achieving robust and reliable results, with evidential classification surpassing traditional training based on multiclass classification as a baseline for deep metric learning. Furthermore, we improve the state-of-the-art retrieval results on several datasets by leveraging the Global Context Vision Transformer (GC ViT) architecture. Our experimental results consistently demonstrate the reliability of our approach, setting a new benchmark in CBIR in all test settings on the Stanford Online Products (SOP) and CUB-200-2011 datasets.