🤖 AI Summary
Attention-based re-rankers suffer from noise and redundancy due to the excessive number of Transformer attention heads, limiting retrieval effectiveness.
Method: We propose CoRe (Contrastive Re-ranking), a parameter-free framework that employs contrastive learning to quantify the discriminative attention of each head toward relevant documents, enabling dynamic selection of high-value heads. Furthermore, we introduce a relative ranking criterion to identify the optimal head distribution—empirically concentrated in middle layers—and prune the last 50% of layers to accelerate inference.
Contribution/Results: CoRe requires no fine-tuning, supports zero-shot and long-context large language models (LLMs), and operates at the list level. Evaluated on three mainstream LLMs, it achieves state-of-the-art re-ranking performance using fewer than 1% of attention heads—significantly outperforming strong baselines while drastically reducing computational overhead.
📝 Abstract
The strong zero-shot and long-context capabilities of recent Large Language Models (LLMs) have paved the way for highly effective re-ranking systems. Attention-based re-rankers leverage attention weights from transformer heads to produce relevance scores, but not all heads are created equally: many contribute noise and redundancy, thus limiting performance. To address this, we introduce CoRe heads, a small set of retrieval heads identified via a contrastive scoring metric that explicitly rewards high attention heads that correlate with relevant documents, while downplaying nodes with higher attention that correlate with irrelevant documents. This relative ranking criterion isolates the most discriminative heads for re-ranking and yields a state-of-the-art list-wise re-ranker. Extensive experiments with three LLMs show that aggregated signals from CoRe heads, constituting less than 1% of all heads, substantially improve re-ranking accuracy over strong baselines. We further find that CoRe heads are concentrated in middle layers, and pruning the computation of final 50% of model layers preserves accuracy while significantly reducing inference time and memory usage.