LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work proposes LogicSkills, a structured benchmark that disentangles formal logical reasoning into three independently assessable core skills: symbolization, countermodel construction, and validity judgment. To rigorously evaluate these capabilities, the benchmark employs bilingual test items—pairing natural language with a Carroll-style fictional language—generated from the two-variable fragment of first-order logic without equality; all samples are verified for correctness and non-triviality using the SMT solver Z3. Experimental results reveal that while state-of-the-art large language models perform adequately on validity judgment, they exhibit significant deficiencies in symbolization and countermodel construction, suggesting a reliance on superficial patterns rather than genuine symbolic reasoning abilities.

Technology Category

Application Category

📝 Abstract

Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) $\textit{formal symbolization}\unicode{x2014}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$formulating a finite structure in which all premises are true while the conclusion is false; and (iii) $\textit{validity assessment}\unicode{x2014}$deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.

Problem

Research questions and friction points this paper is trying to address.

formal reasoning

logical skills

large language models

symbolization

validity assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

formal reasoning

structured benchmark

symbolic reasoning