🤖 AI Summary
It remains unclear whether large language models (LLMs) possess genuine formal logical reasoning capabilities—particularly in syllogistic inference—or merely emulate human intuitive reasoning through statistical pattern matching.
Method: We introduce the first unified evaluation framework that jointly assesses symbolic logical validity and natural language comprehension, benchmarking 14 state-of-the-art LLMs on a standardized syllogism test suite.
Contribution/Results: Syllogistic reasoning is not a universally emergent capability across LLMs; rather, performance varies significantly. Notably, several models achieve 100% accuracy on symbolic syllogistic tasks, demonstrating behavior closely aligned with formal logic engines. This challenges the prevailing assumption that LLMs rely solely on surface-level statistical correlations to mimic reasoning. Our work provides critical empirical evidence and a methodological foundation for characterizing the nature of LLM reasoning and advancing trustworthy AI systems.
📝 Abstract
We study syllogistic reasoning in LLMs from the logical and natural language perspectives. In process, we explore fundamental reasoning capabilities of the LLMs and the direction this research is moving forward. To aid in our studies, we use 14 large language models and investigate their syllogistic reasoning capabilities in terms of symbolic inferences as well as natural language understanding. Even though this reasoning mechanism is not a uniform emergent property across LLMs, the perfect symbolic performances in certain models make us wonder whether LLMs are becoming more and more formal reasoning mechanisms, rather than making explicit the nuances of human reasoning.