Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing attention-based re-ranking methods typically rely on static aggregation or heuristic selection of attention heads, which fails to adapt to the varying informativeness of heads across different queries and often suffers from redundancy or conflict among heads, degrading performance. To address this, this work proposes RouteHead, the first query-aware dynamic routing mechanism for attention heads. It leverages frozen large language models to extract query embeddings and combines them with learnable head embeddings to train a lightweight router via offline pseudo-labels and sparse regularization. This router dynamically selects an optimal subset of heads for each query to compute relevance scores. Evaluated across multiple benchmarks and diverse LLM backbones, RouteHead significantly outperforms strong baselines, demonstrating the effectiveness and generalizability of dynamic head selection for zero-shot re-ranking.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have recently been explored as fine-grained zero-shot re-rankers by leveraging attention signals to estimate document relevance. However, existing methods either aggregate attention signals across all heads or rely on a statically selected subset identified by heuristic rules. This solution can be suboptimal because the informative heads can vary across queries or domains. Moreover, naively combining multiple heads can degrade performance due to redundancy or conflicting ranking signals. In this paper, we propose a query-dependent head selection method, RouteHead, for attention-based re-ranking with LLMs. Specifically, we learn a lightweight router that can map each query to an optimal head set, and relevance scores are computed by aggregating attention signals only from these heads. Since query-to-head optimal labels are unavailable, we first construct pseudo labels via an offline search. The router represents each head with a learnable embedding and represents each query using an embedding extracted from the hidden states of the frozen LLM. Then it is trained on the pseudo labels with a sparsity regularizer. Experiments on diverse benchmarks and multiple LLM backbones show that the proposed method consistently outperforms strong baselines.

Problem

Research questions and friction points this paper is trying to address.

attention-based re-ranking

query-dependent head selection

large language models

relevance estimation

attention heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

query-dependent routing

attention head selection

pseudo-label training