🤖 AI Summary
This work addresses the limitations of traditional vector retrieval, which is constrained by fixed top-k results and struggles to model contextual query-document interactions and multimodal relevance distributions. The authors propose a novel framework that integrates large language model (LLM)-based query rewriting with multimodal Bayesian optimization. Starting with an LLM-rewritten query to initialize the posterior, the method iteratively samples batches of documents and uses the LLM to provide relevance scores, which are then used to update the posterior distribution—thereby transcending the top-k bottleneck for more comprehensive relevance modeling. This approach uniquely combines Bayesian optimization, LLM-driven query reformulation, and batch-wise relevance observation. Evaluated on five BEIR datasets, it achieves substantial gains, including 46.5% recall@100 (vs. 35.0% baseline) and 63.6% NDCG@10 on Robust04, with latency comparable to existing LLM-based rerankers.
📝 Abstract
LLM-reranking is limited by the top-k documents retrieved by vector similarity, which neither enables contextual query-document token interactions nor captures multimodal relevance distributions. While LLM query reformulation attempts to improve recall by generating improved or additional queries, it is still followed by vector similarity retrieval. We thus propose to address these top-k retrieval stage failures by introducing ReBOL, which 1) uses LLM query reformulations to initialize a multimodal Bayesian Optimization (BO) posterior over document relevance, and 2) iteratively acquires document batches for LLM query-document relevance scoring followed by posterior updates to optimize relevance. After exploring query reformulation and document batch diversification techniques, we evaluate ReBOL against LLM reranker baselines on five BEIR datasets and using two LLMs (Gemini-2.5-Flash-Lite, GPT-5.2). ReBOL consistently achieves higher recall and competitive rankings, for example compared to the best LLM reranker on the Robust04 dataset with 46.5% vs. 35.0% recall@100 and 63.6% vs. 61.2% NDCG@10. We also show that ReBOL can achieve comparable latency to LLM rerankers.