Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks

📅 2026-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether the performance of large language models on formal language tasks stems from genuine symbolic reasoning or mere pattern memorization. To this end, we introduce a novel benchmark based on deterministic finite automata (DFAs), comprising both seen tasks and two types of unseen regular language construction problems. By integrating handwritten multi-constraint instances with systematically generated examples derived from Arden’s Theorem, our benchmark rigorously disentangles memorization from reasoning. Employing a three-stage prompting refinement protocol—augmented with strategies such as Chain-of-Thought and Tree-of-Thought—and complemented by structured error analysis, we find that models achieve 84–90% accuracy on seen tasks but suffer a sharp performance drop of 30–64% on unseen tasks. These results expose fundamental limitations in the models’ capacity for semantic consistency and global structural reasoning.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated strong performance on formal language tasks, yet whether this reflects genuine symbolic reasoning or pattern matching on familiar constructions remains unclear. We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages, comprising factual knowledge questions, seen construction problems from public sources, and two types of unseen problems: hand-crafted instances with multiple interacting constraints and systematically generated problems via Arden's theorem. Models achieve perfect accuracy on factual questions and 84-90% on seen tasks. However, accuracy drops sharply on unseen problems (by 30-64%), with failures stemming from systematic misinterpretation of language constraints, incorrect handling of Kleene-star semantics, and a failure to preserve global consistency. We evaluate a three-stage hint protocol that enables correction of shallow errors but does not reliably resolve globally inconsistent or structurally flawed automata. Our analysis across multiple prompting strategies (direct, Chain-of-Thought, Tree-of-Thought) reveals that errors persist regardless of prompting approach, exposing a fundamental gap between LLMs'ability to generate syntactically plausible DFAs and their capacity for semantically correct formal reasoning.
Problem

Research questions and friction points this paper is trying to address.

symbolic reasoning
formal language
deterministic finite automata
unseen tasks
language constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

symbolic reasoning
deterministic finite automata
unseen problem generalization
formal language understanding
prompting strategies
🔎 Similar Papers
No similar papers found.