LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes LogicSkills, a structured benchmark that disentangles formal logical reasoning into three independently assessable core skills: symbolization, countermodel construction, and validity judgment. To rigorously evaluate these capabilities, the benchmark employs bilingual test items—pairing natural language with a Carroll-style fictional language—generated from the two-variable fragment of first-order logic without equality; all samples are verified for correctness and non-triviality using the SMT solver Z3. Experimental results reveal that while state-of-the-art large language models perform adequately on validity judgment, they exhibit significant deficiencies in symbolization and countermodel construction, suggesting a reliance on superficial patterns rather than genuine symbolic reasoning abilities.

Technology Category

Application Category

📝 Abstract
Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) $\textit{formal symbolization}\unicode{x2014}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$formulating a finite structure in which all premises are true while the conclusion is false; and (iii) $\textit{validity assessment}\unicode{x2014}$deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.
Problem

Research questions and friction points this paper is trying to address.

formal reasoning
logical skills
large language models
symbolization
validity assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

formal reasoning
structured benchmark
symbolic reasoning
countermodel construction
logic evaluation
🔎 Similar Papers
No similar papers found.