🤖 AI Summary
This study investigates error mechanisms in large language models (LLMs) during code execution reasoning, specifically focusing on intermediate state misjudgment and control-flow deviations. We construct the first empirical benchmark targeting reasoning trajectories, comprising 427 diverse code snippets—including canonical, edge-case, and invalid inputs—and systematically analyze the behavior of four state-of-the-art reasoning LLMs on extended HumanEval Plus and LiveCodeBench datasets. We introduce a novel, fine-grained taxonomy of nine reasoning error categories, validated through human annotation and automated trajectory analysis to pinpoint root causes. Experimental results show model output accuracy ranging from 85% to 98%; tool augmentation mitigates 58% of computation-related errors, substantially enhancing reasoning reliability. Our work provides an interpretable error diagnostic framework and verifiable correction pathways for trustworthy code generation.
📝 Abstract
Understanding a program's runtime reasoning behavior, meaning how intermediate states and control flows lead to final execution results, is essential for reliable code generation, debugging, and automated reasoning. Although large language models (LLMs) can accurately predict program outputs, most prior work has focused on output accuracy and performance, treating reasoning as a black box. As a result, little is known about the structure or failure modes of their reasoning traces. To address this gap, we conduct the first empirical study on runtime behavior inference with reasoning LLMs, aiming to uncover and characterize errors in their reasoning traces. We curate a benchmark from HumanEval Plus and LiveCodeBench, containing 427 code snippets. For each snippet, we test three input types: regular, edge, and invalid. Twelve input values are selected per snippet, each paired with its ground-truth execution result. We evaluate four state-of-the-art reasoning LLMs. Our results show that these models reach accuracies between 85 percent and 98 percent across input types. We also analyze the produced reasoning traces and develop a taxonomy with nine categories of inference errors. Finally, we explore tool-augmented reasoning. Using failures in the Computation Errors category as a case study, our experiments show that this approach corrects 58 percent of such errors, demonstrating the potential of tool support for improving LLM reasoning.