Turing Machine Evaluation for Large Language Model

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations of large language models (LLMs) lack rigorous assessment of their core computational reasoning capabilities. Method: This paper introduces a novel evaluation paradigm grounded in the state evolution of a universal Turing machine (UTM), dynamically tracking tape contents and head positions across multiple steps to assess models’ understanding of rules, strict adherence to instructions, and maintenance of state consistency. Contribution/Results: Based on this paradigm, we construct TMBench—the first knowledge-agnostic, difficulty-controllable, and infinitely scalable computational reasoning benchmark. Experiments show that TMBench exhibits strong correlation with mainstream reasoning benchmarks (Pearson = 0.73), effectively characterizing LLMs’ deep execution capabilities and enabling systematic evaluation of capability progression. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
With the rapid development and widespread application of Large Language Models (LLMs), rigorous evaluation has become particularly crucial. This research adopts a novel perspective, focusing on evaluating the core computational reasoning ability of LLMs, defined as the capacity of model to accurately understand rules, and execute logically computing operations. This capability assesses the reliability of LLMs as precise executors, and is critical to advanced tasks such as complex code generation and multi-step problem-solving. We propose an evaluation framework based on Universal Turing Machine (UTM) simulation. This framework requires LLMs to strictly follow instructions and track dynamic states, such as tape content and read/write head position, during multi-step computations. To enable standardized evaluation, we developed TMBench, a benchmark for systematically studying the computational reasoning capabilities of LLMs. TMBench provides several key advantages, including knowledge-agnostic evaluation, adjustable difficulty, foundational coverage through Turing machine encoding, and unlimited capacity for instance generation, ensuring scalability as models continue to evolve. We find that model performance on TMBench correlates strongly with performance on other recognized reasoning benchmarks (Pearson correlation coefficient is 0.73), clearly demonstrating that computational reasoning is a significant dimension for measuring the deep capabilities of LLMs. Code and data are available at https://github.com/HaitaoWuTJU/Turing-Machine-Bench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' computational reasoning via Turing Machine simulation
Assessing model reliability in rule understanding and logical execution
Developing TMBench for scalable, knowledge-agnostic reasoning evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

UTM simulation framework for LLM evaluation
TMBench benchmark for computational reasoning
Knowledge-agnostic scalable Turing machine encoding
🔎 Similar Papers
No similar papers found.