Natural Language Satisfiability: Exploring the Problem Distribution and Evaluating Transformer-based Language Models

📅 2025-08-23

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work investigates the computational complexity—spanning P, NP, and PSPACE classes—and syntactic structure effects on Transformer models’ logical reasoning capabilities for Natural Language Satisfiability (NL-SAT). We introduce the first NL-SAT benchmark covering multiple complexity classes, constructed via formal semantic modeling and controllable instance generation to yield a multidimensional evaluation suite orthogonal in syntax and complexity. We further propose a novel evaluation protocol based on reasoning-path consistency, moving beyond conventional accuracy metrics. Empirical results demonstrate a marked performance degradation with increasing problem complexity and significant susceptibility to superficial syntactic variations, revealing fundamental limitations in symbolic reasoning. Our study provides a theory-grounded data benchmark and methodological framework for natural language inference evaluation, exposing intrinsic bottlenecks of current large language models in formal reasoning tasks.

Technology Category

Application Category

📝 Abstract

Efforts to apply transformer-based language models (TLMs) to the problem of reasoning in natural language have enjoyed ever-increasing success in recent years. The most fundamental task in this area to which nearly all others can be reduced is that of determining satisfiability. However, from a logical point of view, satisfiability problems vary along various dimensions, which may affect TLMs' ability to learn how to solve them. The problem instances of satisfiability in natural language can belong to different computational complexity classes depending on the language fragment in which they are expressed. Although prior research has explored the problem of natural language satisfiability, the above-mentioned point has not been discussed adequately. Hence, we investigate how problem instances from varying computational complexity classes and having different grammatical constructs impact TLMs' ability to learn rules of inference. Furthermore, to faithfully evaluate TLMs, we conduct an empirical study to explore the distribution of satisfiability problems.

Problem

Research questions and friction points this paper is trying to address.

Investigating how computational complexity classes affect transformer models' inference learning

Evaluating the impact of different grammatical constructs on language model performance

Exploring the distribution of natural language satisfiability problems for faithful evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating transformer models on satisfiability problems

Analyzing complexity classes impact on model learning

Empirical study of natural language problem distribution

🔎 Similar Papers

Large Language Model Enhanced Knowledge Representation Learning: A Survey