Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the high computational cost and limited interpretability of dense image retrieval, which hinder its efficient deployment at scale. The authors propose an efficient image retrieval method that leverages sparse autoencoders (SAEs) and BM25 scoring: visual word activations are extracted from patch-level features of a Vision Transformer, and for the first time, the BM25 mechanism is adapted to the visual domain. By applying inverse document frequency (IDF) weighting, the approach suppresses frequent, low-informative visual words while emphasizing rare, discriminative ones. Combined with an inverted index and a two-stage retrieval architecture, the method achieves zero-shot transfer without fine-tuning, attaining Recall@200 ≥ 0.993 across seven benchmarks. Remarkably, reranking only the top-200 candidates recovers near-dense retrieval accuracy, with an average gap of less than 0.2%, offering both high efficiency and strong interpretability.

Technology Category

Application Category

📝 Abstract

Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf{BM25-V}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25's inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 $\geq$ 0.993, enabling a two-stage pipeline that reranks only $K{=}200$ candidates per query and recovers near-dense accuracy within $0.2$\% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.

Problem

Research questions and friction points this paper is trying to address.

image retrieval

interpretability

sparse representation

computational efficiency

attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

BM25

Sparse Auto-Encoder

Visual Words