🤖 AI Summary
This work investigates the computational complexity—spanning P, NP, and PSPACE classes—and syntactic structure effects on Transformer models’ logical reasoning capabilities for Natural Language Satisfiability (NL-SAT). We introduce the first NL-SAT benchmark covering multiple complexity classes, constructed via formal semantic modeling and controllable instance generation to yield a multidimensional evaluation suite orthogonal in syntax and complexity. We further propose a novel evaluation protocol based on reasoning-path consistency, moving beyond conventional accuracy metrics. Empirical results demonstrate a marked performance degradation with increasing problem complexity and significant susceptibility to superficial syntactic variations, revealing fundamental limitations in symbolic reasoning. Our study provides a theory-grounded data benchmark and methodological framework for natural language inference evaluation, exposing intrinsic bottlenecks of current large language models in formal reasoning tasks.
📝 Abstract
Efforts to apply transformer-based language models (TLMs) to the problem of reasoning in natural language have enjoyed ever-increasing success in recent years. The most fundamental task in this area to which nearly all others can be reduced is that of determining satisfiability. However, from a logical point of view, satisfiability problems vary along various dimensions, which may affect TLMs' ability to learn how to solve them. The problem instances of satisfiability in natural language can belong to different computational complexity classes depending on the language fragment in which they are expressed. Although prior research has explored the problem of natural language satisfiability, the above-mentioned point has not been discussed adequately. Hence, we investigate how problem instances from varying computational complexity classes and having different grammatical constructs impact TLMs' ability to learn rules of inference. Furthermore, to faithfully evaluate TLMs, we conduct an empirical study to explore the distribution of satisfiability problems.