🤖 AI Summary
This study investigates whether the performance of large language models on formal language tasks stems from genuine symbolic reasoning or mere pattern memorization. To this end, we introduce a novel benchmark based on deterministic finite automata (DFAs), comprising both seen tasks and two types of unseen regular language construction problems. By integrating handwritten multi-constraint instances with systematically generated examples derived from Arden’s Theorem, our benchmark rigorously disentangles memorization from reasoning. Employing a three-stage prompting refinement protocol—augmented with strategies such as Chain-of-Thought and Tree-of-Thought—and complemented by structured error analysis, we find that models achieve 84–90% accuracy on seen tasks but suffer a sharp performance drop of 30–64% on unseen tasks. These results expose fundamental limitations in the models’ capacity for semantic consistency and global structural reasoning.
📝 Abstract
Large language models (LLMs) have demonstrated strong performance on formal language tasks, yet whether this reflects genuine symbolic reasoning or pattern matching on familiar constructions remains unclear. We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages, comprising factual knowledge questions, seen construction problems from public sources, and two types of unseen problems: hand-crafted instances with multiple interacting constraints and systematically generated problems via Arden's theorem. Models achieve perfect accuracy on factual questions and 84-90% on seen tasks. However, accuracy drops sharply on unseen problems (by 30-64%), with failures stemming from systematic misinterpretation of language constraints, incorrect handling of Kleene-star semantics, and a failure to preserve global consistency. We evaluate a three-stage hint protocol that enables correction of shallow errors but does not reliably resolve globally inconsistent or structurally flawed automata. Our analysis across multiple prompting strategies (direct, Chain-of-Thought, Tree-of-Thought) reveals that errors persist regardless of prompting approach, exposing a fundamental gap between LLMs'ability to generate syntactically plausible DFAs and their capacity for semantically correct formal reasoning.