🤖 AI Summary
This work investigates whether current neural retrieval models genuinely adhere to logical constraints in set-combination queries—those involving conjunction, disjunction, and exclusion—or instead rely on semantic shortcuts. To this end, the authors introduce LIMIT+, a controlled benchmark that decouples pretraining knowledge from constraint-based reasoning, enabling systematic evaluation of generalization across BM25, dense and sparse retrieval models, and reasoning-oriented approaches such as ReasonIR and Search-R1. Experiments reveal that while neural models achieve Recall@100 above 0.41 on QUEST, their performance collapses to below 0.02 on LIMIT+, in stark contrast to BM25’s robust 0.96. This discrepancy exposes a critical reliance on semantic priors rather than true logical reasoning, underscoring LIMIT+ as a reliable foundation for future research in constrained retrieval.
📝 Abstract
Complex information needs may involve set-compositional queries using conjunction, disjunction, and exclusion, yet it remains unclear whether current retrieval paradigms genuinely satisfy such constraints or exploit `semantic shortcuts'. We conduct a reproducibility study to benchmark major retrieval families and reasoning-targeted methods on QUEST and QUEST+Variants, and introduce LIMIT+, a controlled benchmark where relevance depends on arbitrary attribute predicates and constraint satisfaction, and less on pretrained knowledge. Our findings show that (i) on QUEST, the best neural retrievers achieve an effectiveness that is more than double what can be achieved with BM25 (Recall@100 ${>}$0.41 vs.\ 0.20), but reasoning-targeted methods like ReasonIR and Search-R1 do not outperform general-purpose retrievers uniformly; (ii) on LIMIT+, gains fail to transfer, where the strongest QUEST method collapses from Recall@100${\approx}$0.42 to below 0.02, while classic lexical retrieval gains to ${\sim}$0.96. Lastly, (iii) stratifying by compositional depth reveals a consistent degradation across all methods, where algebraic sparse and lexical methods show more stable performance while dense approaches collapse. We release code and LIMIT+ data generation scripts to support future reproducibility and controlled evaluation.