🤖 AI Summary
This work investigates the logical consistency of large language models (LLMs) in natural language inference (NLI) and defeasible NLI when confronted with atomic-level hypothesis decomposition. Methodologically, we propose decomposing composite hypotheses into semantically atomic propositions to construct fine-grained subproblems, and introduce the first defeasible reasoning attribution framework grounded in critical atomic subproblems. We define a novel metric—“reasoning consistency”—to quantify model judgment stability regarding the same factual claim across diverse contexts. Experiments reveal pervasive logical inconsistency in LLMs’ atomic-level NLI and defeasible inference; identify key atomic subproblems that dominate final label predictions; and demonstrate that our metric effectively discriminates model robustness in reasoning, offering a new lens for evaluating deep reasoning capabilities and dataset diversity. (149 words)
📝 Abstract
Decomposition of text into atomic propositions is a flexible framework allowing for the closer inspection of input and output text. We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems, or granular inferences that models must weigh when solving the overall problem. These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning, probe a model's consistency and understanding of different inferences, and measure the diversity of examples in benchmark datasets. Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify critical atomic sub-problems of defeasible NLI examples, or those that most contribute to the overall label, and propose a method to measure the inferential consistency of a model, a metric designed to capture the degree to which a model makes consistently correct or incorrect predictions about the same fact under different contexts.