Demystifying Errors in LLM Reasoning Traces: An Empirical Study of Code Execution Simulation

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This study investigates error mechanisms in large language models (LLMs) during code execution reasoning, specifically focusing on intermediate state misjudgment and control-flow deviations. We construct the first empirical benchmark targeting reasoning trajectories, comprising 427 diverse code snippets—including canonical, edge-case, and invalid inputs—and systematically analyze the behavior of four state-of-the-art reasoning LLMs on extended HumanEval Plus and LiveCodeBench datasets. We introduce a novel, fine-grained taxonomy of nine reasoning error categories, validated through human annotation and automated trajectory analysis to pinpoint root causes. Experimental results show model output accuracy ranging from 85% to 98%; tool augmentation mitigates 58% of computation-related errors, substantially enhancing reasoning reliability. Our work provides an interpretable error diagnostic framework and verifiable correction pathways for trustworthy code generation.

Technology Category

Application Category

📝 Abstract

Understanding a program's runtime reasoning behavior, meaning how intermediate states and control flows lead to final execution results, is essential for reliable code generation, debugging, and automated reasoning. Although large language models (LLMs) can accurately predict program outputs, most prior work has focused on output accuracy and performance, treating reasoning as a black box. As a result, little is known about the structure or failure modes of their reasoning traces. To address this gap, we conduct the first empirical study on runtime behavior inference with reasoning LLMs, aiming to uncover and characterize errors in their reasoning traces. We curate a benchmark from HumanEval Plus and LiveCodeBench, containing 427 code snippets. For each snippet, we test three input types: regular, edge, and invalid. Twelve input values are selected per snippet, each paired with its ground-truth execution result. We evaluate four state-of-the-art reasoning LLMs. Our results show that these models reach accuracies between 85 percent and 98 percent across input types. We also analyze the produced reasoning traces and develop a taxonomy with nine categories of inference errors. Finally, we explore tool-augmented reasoning. Using failures in the Computation Errors category as a case study, our experiments show that this approach corrects 58 percent of such errors, demonstrating the potential of tool support for improving LLM reasoning.

Problem

Research questions and friction points this paper is trying to address.

Characterize errors in LLM reasoning traces during code execution simulation

Analyze failure modes in runtime behavior inference of reasoning LLMs

Explore tool-augmented reasoning to correct computation errors in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical study of LLM reasoning traces errors

Taxonomy with nine inference error categories developed

Tool-augmented reasoning corrects 58% computation errors

🔎 Similar Papers

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models