On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of generalization and stability in large language model (LLM)-based dense retrievers. It presents the first comprehensive analysis of the “specialization tax” phenomenon in LLM retrievers, integrating linear mixed-effects modeling, adversarial attacks, query perturbations, and embedding geometry analysis. The authors demonstrate that geometric properties of the embedding space effectively predict lexical stability and reveal a positive correlation between model scale and robustness. Experimental results show that instruction-tuned models exhibit stronger generalization capabilities, whereas inference-optimized models suffer from limited generalization. Furthermore, LLM retrievers prove robust to spelling errors and corpus poisoning but remain vulnerable to semantic perturbations such as synonym substitution.

Technology Category

Application Category

📝 Abstract

Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,'' exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations~(e.g., paraphrasing, typos) and malicious adversarial attacks~(e.g., corpus poisoning). We find that LLM-based retrievers show improved robustness against typos and corpus poisoning compared to encoder-only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness-aware retriever design and principled benchmarking. Our code is publicly available at https://github.com/liyongkang123/Robust_LLM_Retriever_Eval.

Problem

Research questions and friction points this paper is trying to address.

robustness

dense retrieval

large language models

generalizability

stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based dense retrieval

robustness evaluation

generalizability