🤖 AI Summary
This study addresses the limitations of large language models (LLMs) in higher-order cognitive tasks—specifically pattern recognition, spatial reasoning, arithmetic, and logical inference—by introducing TextGames, the first benchmark for deep-reasoning evaluation grounded in text-based puzzle games. TextGames supports both single-step and multi-step reasoning assessment and incorporates an interactive, feedback-driven self-reflection mechanism. Experimental results show that reasoning-optimized models substantially outperform instruction-following baselines; iterative self-reflection improves accuracy, yet sequence modeling fidelity, counting precision, and rule consistency remain critical bottlenecks. While LLMs achieve near-human performance on easy and medium-difficulty tasks, they fall significantly short on complex ones. The work contributes a task-driven, multidimensional reasoning evaluation framework; a human-annotated dataset with fine-grained difficulty grading; and the first systematic characterization of LLMs’ deep-reasoning capabilities—and their evolutionary trajectory—in text-based interactive environments.
📝 Abstract
Reasoning is a fundamental capability of large language models (LLMs), enabling them to comprehend, analyze, and solve complex problems. In this paper, we introduce TextGames, an innovative benchmark specifically crafted to assess LLMs through demanding text-based games that require advanced skills in pattern recognition, spatial awareness, arithmetic, and logical reasoning. Our analysis probes LLMs' performance in both single-turn and multi-turn reasoning, and their abilities in leveraging feedback to correct subsequent answers through self-reflection. Our findings reveal that, although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks. In contrast, humans are capable of solving all tasks when given sufficient time. Moreover, we observe that LLMs show improved performance in multi-turn predictions through self-reflection, yet they still struggle with sequencing, counting, and following complex rules consistently. Additionally, models optimized for reasoning outperform pre-trained LLMs that prioritize instruction following, highlighting the crucial role of reasoning skills in addressing highly complex problems.