🤖 AI Summary
Existing benchmarks treat negation as a peripheral phenomenon in tasks like natural language inference, lacking dedicated evaluation for sentence-level negation understanding. Method: We introduce NegBench—the first structured benchmark explicitly designed to assess large language models’ (LLMs) sentence-level negation comprehension—covering semantic variants including standard negation, local negation, contradiction, and paraphrase. It employs human-annotated sentence pairs and multiple-choice discrimination tasks for fine-grained, semantically grounded evaluation. Contribution/Results: NegBench is the first framework to isolate negation understanding as a primary objective—not a byproduct—and introduces structured negative exemplar contrast. It provides a high-quality, reproducible, and interpretable quantitative evaluation framework. Experiments reveal systematic failures of mainstream LLMs in negation logic reasoning, establishing NegBench as a critical diagnostic and improvement tool for negation-aware model development.
📝 Abstract
Negation is a fundamental linguistic phenomenon that poses persistent challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Existing benchmarks often treat negation as a side case within broader tasks like natural language inference, resulting in a lack of benchmarks that exclusively target negation understanding. In this work, we introduce extbf{Thunder-NUBench}, a novel benchmark explicitly designed to assess sentence-level negation understanding in LLMs. Thunder-NUBench goes beyond surface-level cue detection by contrasting standard negation with structurally diverse alternatives such as local negation, contradiction, and paraphrase. The benchmark consists of manually curated sentence-negation pairs and a multiple-choice dataset that enables in-depth evaluation of models' negation understanding.