Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing zero-shot reranking methods rely on generation- or likelihood-based scoring, suffering from high inference latency and inconsistent results. This work systematically evaluates generative, likelihood-based, and internal attention mechanisms across multiple reranking frameworks and reveals, for the first time, that relevance signals derived from Transformer internal attention exhibit a universal “bell-shaped” distribution across layers. Building on this insight, the authors propose Selective-ICR, a layer selection strategy that requires no additional training. On the BRIGHT benchmark, Selective-ICR reduces inference latency by 30%–50% while enabling a 0.6B-parameter model to outperform current state-of-the-art generative approaches and an 8B-parameter model to match the performance of a 14B-parameter reinforcement learning-based reranker, demonstrating the potential of small models to achieve both efficiency and high-quality reranking through internal signal exploitation.

Technology Category

Application Category

📝 Abstract

Zero-shot document re-ranking with Large Language Models (LLMs) has evolved from Pointwise methods to Listwise and Setwise approaches that optimize computational efficiency. Despite their success, these methods predominantly rely on generative scoring or output logits, which face bottlenecks in inference latency and result consistency. In-Context Re-ranking (ICR) has recently been proposed as an $O(1)$ alternative method. ICR extracts internal attention signals directly, avoiding the overhead of text generation. However, existing ICR methods simply aggregate signals across all layers; layer-wise contributions and their consistency across architectures have been left unexplored. Furthermore, no unified study has compared internal attention with traditional generative and likelihood-based mechanisms across diverse ranking frameworks under consistent conditions. In this paper, we conduct an orthogonal evaluation of generation, likelihood, and internal attention mechanisms across multiple ranking frameworks. We further identify a universal "bell-curve" distribution of relevance signals across transformer layers, which motivates the proposed Selective-ICR strategy that reduces inference latency by 30%-50% without compromising effectiveness. Finally, evaluation on the reasoning-intensive BRIGHT benchmark shows that precisely capturing high-quality in-context attention signals fundamentally reduces the need for model scaling and reinforcement learning: a zero-shot 8B model matches the performance of 14B reinforcement-learned re-rankers, while even a 0.6B model outperforms state-of-the-art generation-based approaches. These findings redefine the efficiency-effectiveness frontier for LLM-based re-ranking and highlight the latent potential of internal signals for complex reasoning ranking tasks. Our code and results are publicly available at https://github.com/ielab/Selective-ICR.

Problem

Research questions and friction points this paper is trying to address.

zero-shot re-ranking

internal attention

in-context learning

LLM efficiency

layer-wise analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Internal Attention

Zero-Shot Re-Ranking

Selective-ICR