🤖 AI Summary
This work identifies a critical limitation of the implicit “Reference Determinacy” (RD) assumption in natural language inference (NLI)—namely, that premises and hypotheses are assumed to refer to the same context—which fails in real-world tasks like fact verification. To address this, the authors formally introduce the concept of “Reference Uncertainty” and propose RefNLI, the first diagnostic benchmark for reference robustness: it constructs cross-context premise-hypothesis pairs via Wikipedia retrieval and evaluates referential consistency through human annotation and large language model (LLM) few-shot prompting. Experiments reveal that state-of-the-art fine-tuned NLI models and few-shot LLMs exhibit contradiction error rates exceeding 80% and entailment error rates over 50% under referential misalignment. RefNLI quantifies both model bias induced by the RD assumption and inter-annotator disagreement, thereby catalyzing a paradigm shift in NLI dataset construction toward context-aware evaluation.
📝 Abstract
We revisit the reference determinacy (RD) assumption in the task of natural language inference (NLI), i.e., the premise and hypothesis are assumed to refer to the same context when human raters annotate a label. While RD is a practical assumption for constructing a new NLI dataset, we observe that current NLI models, which are typically trained solely on hypothesis-premise pairs created with the RD assumption, fail in downstream applications such as fact verification, where the input premise and hypothesis may refer to different contexts. To highlight the impact of this phenomenon in real-world use cases, we introduce RefNLI, a diagnostic benchmark for identifying reference ambiguity in NLI examples. In RefNLI, the premise is retrieved from a knowledge source (i.e., Wikipedia) and does not necessarily refer to the same context as the hypothesis. With RefNLI, we demonstrate that finetuned NLI models and few-shot prompted LLMs both fail to recognize context mismatch, leading to over 80% false contradiction and over 50% entailment predictions. We discover that the existence of reference ambiguity in NLI examples can in part explain the inherent human disagreements in NLI and provide insight into how the RD assumption impacts the NLI dataset creation process.