Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In multimodal retrieval, vision-language joint re-ranking suffers from high computational overhead due to online visual feature extraction, hindering scalable deployment. This paper proposes EDJE, the first framework that synergistically combines offline precomputation of visual tokens with a lightweight attention adapter, enabling efficient online inference via compressed visual representations. EDJE employs a CLIP-style dual-tower initialization, a compact joint encoder architecture, and a learnable token compression mechanism. It drastically reduces storage (49 KB per image) and computation while preserving high retrieval accuracy. On Flickr30K and COCO, EDJE achieves state-of-the-art re-ranking performance with an inference throughput of 50,000 image–text pairs per second. The framework thus provides a practical, efficient solution for large-scale multimodal re-ranking.

Technology Category

Application Category

📝 Abstract

Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.

Problem

Research questions and friction points this paper is trying to address.

Overcoming visual feature extraction bottlenecks in vision-language reranking systems

Enabling scalable multimodal retrieval with efficient joint encoders

Reducing storage and computation costs for high-throughput image-text matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Precomputes vision tokens offline for efficiency

Compresses tokens via lightweight attention-based adapter

Enables high-throughput inference with compact joint encoder

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Authors to Follow