Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

πŸ“… 2025-10-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In multimodal retrieval, vision-language joint re-ranking suffers from high computational overhead due to online visual feature extraction, hindering scalable deployment. This paper proposes EDJE, the first framework that synergistically combines offline precomputation of visual tokens with a lightweight attention adapter, enabling efficient online inference via compressed visual representations. EDJE employs a CLIP-style dual-tower initialization, a compact joint encoder architecture, and a learnable token compression mechanism. It drastically reduces storage (49 KB per image) and computation while preserving high retrieval accuracy. On Flickr30K and COCO, EDJE achieves state-of-the-art re-ranking performance with an inference throughput of 50,000 image–text pairs per second. The framework thus provides a practical, efficient solution for large-scale multimodal re-ranking.

Technology Category

Application Category

πŸ“ Abstract
Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.
Problem

Research questions and friction points this paper is trying to address.

Overcoming visual feature extraction bottlenecks in vision-language reranking systems
Enabling scalable multimodal retrieval with efficient joint encoders
Reducing storage and computation costs for high-throughput image-text matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Precomputes vision tokens offline for efficiency
Compresses tokens via lightweight attention-based adapter
Enables high-throughput inference with compact joint encoder
πŸ”Ž Similar Papers
M
Mitchell Keren Taraday
INSIGHT Lab, Ben-Gurion University of the Negev, Israel
S
Shahaf Wagner
INSIGHT Lab, Ben-Gurion University of the Negev, Israel
Chaim Baskin
Chaim Baskin
Assistant Professor (Senior Lecturer) at Ben-Gurion University of the Negev
Deep learningMachine learningComputer VisionGraph Neural NetworksGeometric Deep Learning