Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing vision-based document retrieval methods face two key challenges: (1) vision-centric models suffer from a modality gap and rely on computationally expensive dense multimodal encoding, hindering deployment efficiency; (2) hybrid retrieval approaches perform only coarse-grained fusion of ranking results, failing to capture fine-grained cross-modal interactions. To address these, we propose Guided Query Refinement (GQR), a test-time method that dynamically refines query embeddings from vision-centric models using a lightweight text retriever, enabling efficient and interpretable cross-modal representation alignment. Our multimodal hybrid retrieval framework jointly optimizes accuracy and efficiency. It achieves state-of-the-art performance across multiple benchmarks while improving inference speed by 14× and reducing memory consumption by 54×, significantly advancing the Pareto frontier between effectiveness and efficiency in visual document retrieval.

Technology Category

Application Category

📝 Abstract

Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model's representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever's query embedding using guidance from a complementary retriever's scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval

Problem

Research questions and friction points this paper is trying to address.

Enhancing vision-centric retrieval with lightweight text retriever guidance

Optimizing query embeddings at test time for multimodal hybrid retrieval

Improving performance and efficiency in visual document retrieval systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time optimization refines query embeddings using guidance

Lightweight dense text retriever enhances vision-centric model performance

Hybrid retrieval method improves efficiency while maintaining accuracy

🔎 Similar Papers

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling