🤖 AI Summary
This paper identifies fundamental limitations of large language model (LLM)-driven query expansion (QE) in two prevalent failure scenarios: (1) erroneous expansions due to LLM knowledge gaps, and (2) bias-prone refinements that narrow retrieval scope under query semantic ambiguity. Method: To systematically dissect these failures, the authors first formally distinguish and empirically validate *knowledge insufficiency* and *ambiguity-induced bias* as orthogonal root causes; they then propose a novel QE evaluation framework jointly measuring knowledge coverage and ambiguity robustness, validated via controlled experiments across multiple benchmarks and both sparse (BM25) and dense (ColBERT) retrieval models. Contribution/Results: Quantitative analysis reveals that under knowledge-poor or highly ambiguous queries, NDCG@10 degrades by 18.7% on average—providing critical failure diagnostics and actionable guidance for improving LLM-augmented retrieval.
📝 Abstract
Query expansion (QE) enhances retrieval by incorporating relevant terms, with large language models (LLMs) offering an effective alternative to traditional rule-based and statistical methods. However, LLM-based QE suffers from a fundamental limitation: it often fails to generate relevant knowledge, degrading search performance. Prior studies have focused on hallucination, yet its underlying cause--LLM knowledge deficiencies--remains underexplored. This paper systematically examines two failure cases in LLM-based QE: (1) when the LLM lacks query knowledge, leading to incorrect expansions, and (2) when the query is ambiguous, causing biased refinements that narrow search coverage. We conduct controlled experiments across multiple datasets, evaluating the effects of knowledge and query ambiguity on retrieval performance using sparse and dense retrieval models. Our results reveal that LLM-based QE can significantly degrade the retrieval effectiveness when knowledge in the LLM is insufficient or query ambiguity is high. We introduce a framework for evaluating QE under these conditions, providing insights into the limitations of LLM-based retrieval augmentation.