π€ AI Summary
In multimodal retrieval, vision-language joint re-ranking suffers from high computational overhead due to online visual feature extraction, hindering scalable deployment. This paper proposes EDJE, the first framework that synergistically combines offline precomputation of visual tokens with a lightweight attention adapter, enabling efficient online inference via compressed visual representations. EDJE employs a CLIP-style dual-tower initialization, a compact joint encoder architecture, and a learnable token compression mechanism. It drastically reduces storage (49 KB per image) and computation while preserving high retrieval accuracy. On Flickr30K and COCO, EDJE achieves state-of-the-art re-ranking performance with an inference throughput of 50,000 imageβtext pairs per second. The framework thus provides a practical, efficient solution for large-scale multimodal re-ranking.
π Abstract
Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.