Reproducing NevIR: Negation in Neural Information Retrieval

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the long-overlooked challenge of negation understanding in neural information retrieval (IR), where language models (LMs) exhibit significant deficiencies. We propose a systematic evaluation framework and conduct a comprehensive assessment of mainstream IR models’ ranking performance on negated queries, using the NevIR and ExcluIR benchmarks. Our findings reveal that most models perform near-randomly; although listwise large language model (LLM) re-rankers achieve the best and most robust performance across both negation tasks, they still substantially underperform human-level capability. Moreover, distributional discrepancies between ExcluIR and NevIR hinder cross-dataset generalization. This study provides the first empirical evidence of the relative advantages—and fundamental limitations—of listwise LLM re-rankers in negation-aware retrieval, offering critical insights and concrete directions for advancing negation-sensitive IR modeling.

Technology Category

Application Category

📝 Abstract
Negation is a fundamental aspect of human communication, yet it remains a challenge for Language Models (LMs) in Information Retrieval (IR). Despite the heavy reliance of modern neural IR systems on LMs, little attention has been given to their handling of negation. In this study, we reproduce and extend the findings of NevIR, a benchmark study that revealed most IR models perform at or below the level of random ranking when dealing with negation. We replicate NevIR's original experiments and evaluate newly developed state-of-the-art IR models. Our findings show that a recently emerging category - listwise Large Language Model (LLM) rerankers - outperforms other models but still underperforms human performance. Additionally, we leverage ExcluIR, a benchmark dataset designed for exclusionary queries with extensive negation, to assess the generalizability of negation understanding. Our findings suggest that fine-tuning on one dataset does not reliably improve performance on the other, indicating notable differences in their data distributions. Furthermore, we observe that only cross-encoders and listwise LLM rerankers achieve reasonable performance across both negation tasks.
Problem

Research questions and friction points this paper is trying to address.

Challenges in handling negation in IR
Performance gap between models and humans
Limited generalizability of negation understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Listwise LLM rerankers enhance negation handling
ExcluIR assesses negation understanding generalizability
Cross-encoders improve performance across negation tasks
C
Coen van Elsen
University of Amsterdam, Amsterdam, Noord-Holland, The Netherlands
F
Francien Barkhof
University of Amsterdam, Amsterdam, Noord-Holland, The Netherlands
T
Thijmen Nijdam
University of Amsterdam, Amsterdam, Noord-Holland, The Netherlands
Simon Lupart
Simon Lupart
PhD Student in Conversational AI at the University of Amsterdam (IRLab)
Information RetrievalConversational AILarge Language ModelsReinforcement Learning
M
Mohammad Alliannejadi
University of Amsterdam, Amsterdam, Noord-Holland, The Netherlands