Synthetic Dataset for Evaluating Complex Compositional Knowledge for Natural Language Inference

📅 2023-07-11
🏛️ NLRSE
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capacity of neural natural language inference (NLI) models to reason about logical compositionality—particularly quantifiers, negation, and their nesting. To this end, we introduce SICCK, the first controlled synthetic benchmark explicitly designed to evaluate compositional reasoning over quantification and negation in natural logic. Built upon the SICK dataset, SICCK employs a syntax-semantics modification framework to systematically inject universal/existential quantifiers and sentential or constituent-level negation, yielding 1,304 sentence pairs with precisely specified logical structures and human-verified entailment labels. Experiments reveal that state-of-the-art NLI models fail catastrophically on negation–quantifier combinations under zero-shot evaluation; even after fine-tuning, their performance remains substantially below baseline, exposing a fundamental deficit in compositional generalization. SICCK thus provides the first interpretable, controllable evaluation framework for complex compositional semantics in natural logic, advancing the development of NLI models with formal semantic sensitivity.
📝 Abstract
We introduce a synthetic dataset called Sentences Involving Complex Compositional Knowledge (SICCK) and a novel analysis that investigates the performance of Natural Language Inference (NLI) models to understand compositionality in logic. We produce 1,304 sentence pairs by modifying 15 examples from the SICK dataset (Marelli et al., 2014). To this end, we modify the original texts using a set of phrases modifiers that correspond to universal quantifiers, existential quantifiers, negation, and other concept modifiers in Natural Logic (NL) (MacCartney, 2009). We use these phrases to modify the subject, verb, and object parts of the premise and hypothesis. Lastly, we annotate these modified texts with the corresponding entailment labels following NL rules. We conduct a preliminary verification of how well the change in the structural and semantic composition is captured by neural NLI models, in both zero-shot and fine-tuned scenarios. We found that the performance of NLI models under the zero-shot setting is poor, especially for modified sentences with negation and existential quantifiers. After fine-tuning this dataset, we observe that models continue to perform poorly over negation, existential and universal modifiers.
Problem

Research questions and friction points this paper is trying to address.

Evaluating NLI models' understanding of compositional logic
Assessing model performance on quantifiers and negation
Testing zero-shot and fine-tuned compositional reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created synthetic dataset using logical modifiers
Modified sentence structures with quantifiers and negation
Evaluated NLI models on compositional reasoning tasks
🔎 Similar Papers
No similar papers found.