🤖 AI Summary
In-context ranking (ICR) suffers from severe computational inefficiency on long candidate document lists due to the quadratic complexity of standard LLM attention mechanisms. To address this, we propose BlockRank—a novel, efficient ICR framework that leverages an empirically discovered block-sparse attention pattern inherent in cross-document ranking tasks. BlockRank introduces an architecture-level block-sparse attention mechanism and a query-document contrastive learning objective to improve relevance modeling. Implemented atop Mistral-7B, BlockRank reduces attention computation complexity from quadratic to linear in the number of candidates. It matches or exceeds state-of-the-art performance on BEIR, MSMARCO, and Natural Questions benchmarks. BlockRank achieves a 4.7× speedup for ranking 100 documents and enables near-real-time ranking of up to 500 documents—approximately 100K tokens of context—thereby substantially alleviating the practical deployment bottleneck of ICR.
📝 Abstract
In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that FLARE Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.