🤖 AI Summary
This study addresses the lack of reproducibility and standardized evaluation in existing query rewriting methods for large language models (LLMs), which stems from heterogeneous experimental setups. The authors systematically reimplement and compare ten representative approaches within a rigorously controlled, unified framework, encompassing two LLM architectures (each with two parameter scales), three retrieval paradigms—lexical, sparse-learned, and dense—and nine standard datasets, all evaluated using consistent prompting templates and protocols. Their analysis reveals, for the first time under multidimensional variables, a strong dependency of rewriting efficacy on the retrieval paradigm: gains observed in lexical retrieval do not reliably transfer to neural retrievers, and larger models do not consistently yield performance improvements. The findings are disseminated through QueryGym, an open-source toolkit, and a public leaderboard to enable transparent reproduction and fair comparison.
📝 Abstract
Large Language Models (LLMs) are now widely used for query reformulation and expansion in Information Retrieval, with many studies reporting substantial effectiveness gains. However, these results are typically obtained under heterogeneous experimental conditions, making it difficult to assess which findings are reproducible and which depend on specific implementation choices. In this work, we present a systematic reproducibility and comparative study of ten representative LLM-based query reformulation methods under a unified and strictly controlled experimental framework. We evaluate methods across two architectural LLM families at two parameter scales, three retrieval paradigms (lexical, learned sparse, and dense), and nine benchmark datasets spanning TREC Deep Learning and BEIR. Our results show that reformulation gains are strongly conditioned on the retrieval paradigm, that improvements observed under lexical retrieval do not consistently transfer to neural retrievers, and that larger LLMs do not uniformly yield better downstream performance. These findings clarify the stability and limits of reported gains in prior work. To enable transparent replication and ongoing comparison, we release all prompts, configurations, evaluation scripts, and run files through QueryGym, an open-source reformulation toolkit with a public leaderboard.\footnote{https://leaderboard.querygym.com}