Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

📅 2024-10-03

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work challenges the necessity of autoregressive generation in LLM-based zero-shot re-ranking, arguing that such reliance overemphasizes generative capability and hinders adoption of open-source models. To address this, we propose ICR—a generation-free re-ranker that leverages query-triggered attention pattern shifts as the primary re-ranking signal for the first time, augmented with content-agnostic query calibration to mitigate model bias. ICR requires only a constant number of forward passes, involves no fine-tuning, and is plug-and-play compatible with arbitrary open-source LLMs (e.g., Llama-3, Qwen), while strictly preserving ranking legality. On both single-hop and multi-hop retrieval benchmarks, ICR outperforms RankGPT, achieving over 60% latency reduction in practice and delivering substantial gains—particularly on complex re-ranking tasks.

Technology Category

Application Category

📝 Abstract

Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two ($O(1)$) forward passes to re-rank $N$ documents, making it substantially more efficient than generative re-ranking methods that require at least $O(N)$ forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR's performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.

Problem

Research questions and friction points this paper is trying to address.

Efficient zero-shot re-ranking in IR systems using LLMs

Leveraging attention patterns for accurate document re-ranking

Reducing latency and computational cost in LLM-based re-ranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context re-ranking leverages attention pattern changes.

Calibration method reduces LLM biases using content-free queries.

ICR requires only two forward passes for efficient re-ranking.

🔎 Similar Papers

No similar papers found.