🤖 AI Summary
This work proposes LogicSkills, a structured benchmark that disentangles formal logical reasoning into three independently assessable core skills: symbolization, countermodel construction, and validity judgment. To rigorously evaluate these capabilities, the benchmark employs bilingual test items—pairing natural language with a Carroll-style fictional language—generated from the two-variable fragment of first-order logic without equality; all samples are verified for correctness and non-triviality using the SMT solver Z3. Experimental results reveal that while state-of-the-art large language models perform adequately on validity judgment, they exhibit significant deficiencies in symbolization and countermodel construction, suggesting a reliance on superficial patterns rather than genuine symbolic reasoning abilities.
📝 Abstract
Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) $\textit{formal symbolization}\unicode{x2014}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$formulating a finite structure in which all premises are true while the conclusion is false; and (iii) $\textit{validity assessment}\unicode{x2014}$deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.