Language Model Re-rankers are Steered by Lexical Similarities

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the prevailing assumption that language model (LM) re-rankers are inherently superior to BM25 in retrieval-augmented generation (RAG). We systematically evaluate six LM re-rankers—including Cross-Encoders and ColBERTv2—across NQ, LitQA2, and the adversarial DRUID benchmark. Results show that LM re-rankers consistently underperform BM25 on DRUID, with their ranking decisions heavily driven by lexical surface similarity rather than semantic understanding. To formalize this bias, we propose a novel separability metric grounded in BM25 scores, which quantifies— for the first time—the implicit lexical dependence of LM re-rankers. Our analysis reveals that current evaluation benchmarks lack realistic semantic and adversarial challenges; moreover, existing optimization techniques improve performance only on simple datasets like NQ, failing on DRUID. These findings underscore the urgent need for more semantically rich and adversarially robust evaluation benchmarks to advance reliable RAG systems.

Technology Category

Application Category

📝 Abstract
Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 re-ranker on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LM re-rankers' semantic processing capabilities
Identify weaknesses in LM re-rankers due to lexical dissimilarities
Explore methods to enhance LM re-ranker performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

LM re-rankers analyze semantic information.
BM25 scores identify re-ranker errors.
Methods improve LM re-ranker for NQ.
🔎 Similar Papers
No similar papers found.