Let's CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

133K/year

🤖 AI Summary

This work identifies a critical weakness in natural language inference (NLI) models: their inability to reliably identify presuppositions and perform fine-grained pragmatic reasoning in conditional sentences. To address this gap, we introduce CONFER—the first benchmark dataset specifically designed for evaluating presupposition handling and pragmatic inference in conditionals. We systematically assess four major NLI architectures alongside large language models (LLMs), including GPT-4o, LLaMA, Gemma, and DeepSeek-R1, under zero-shot and few-shot prompting settings. Results show that standard NLI models consistently fail on conditional presupposition tasks, with conventional fine-tuning yielding minimal improvement; while LLMs demonstrate stronger contextual awareness, they still exhibit substantial limitations. This study is the first to empirically characterize and quantify this capability gap, providing a reproducible evaluation framework and rigorous empirical evidence—thereby establishing foundational resources for advancing semantic-pragmatic joint modeling and targeted model improvement.

Technology Category

Application Category

📝 Abstract

Natural Language Inference (NLI) is the task of determining whether a sentence pair represents entailment, contradiction, or a neutral relationship. While NLI models perform well on many inference tasks, their ability to handle fine-grained pragmatic inferences, particularly presupposition in conditionals, remains underexplored. In this study, we introduce CONFER, a novel dataset designed to evaluate how NLI models process inference in conditional sentences. We assess the performance of four NLI models, including two pre-trained models, to examine their generalization to conditional reasoning. Additionally, we evaluate Large Language Models (LLMs), including GPT-4o, LLaMA, Gemma, and DeepSeek-R1, in zero-shot and few-shot prompting settings to analyze their ability to infer presuppositions with and without prior context. Our findings indicate that NLI models struggle with presuppositional reasoning in conditionals, and fine-tuning on existing NLI datasets does not necessarily improve their performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluating NLI models on conditional inference and presupposition

Assessing generalization of NLI models to conditional reasoning

Testing LLMs' ability to infer presuppositions in zero-shot and few-shot settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CONFER dataset for conditional inference evaluation

Assesses NLI models on presupposition in conditionals

Evaluates LLMs in zero-shot and few-shot settings

🔎 Similar Papers

No similar papers found.