🤖 AI Summary
Large Reasoning Models (LRMs) exhibit diverse multi-step reasoning behaviors in code generation, yet the relationship between their reasoning patterns and generated code quality remains poorly understood.
Method: We propose the first taxonomy of LRM reasoning behaviors—comprising four stages and fifteen fine-grained action types—based on human-annotated reasoning traces. We conduct cross-model (e.g., Qwen3, DeepSeek-R1-7B, o3) and cross-task empirical analysis to characterize reasoning dynamics.
Contribution/Results: We identify systematic differences in reasoning paths: Qwen3 adopts iterative refinement, whereas DeepSeek-R1-7B follows a predominantly linear trajectory. Critical actions—including unit test generation and scaffolding construction—significantly improve functional correctness. Moreover, context-aware prompting effectively steers reasoning toward higher-quality paths. Our findings provide both theoretical insights into LRM reasoning mechanisms and practical guidance for prompt engineering and reliability enhancement in code generation.
📝 Abstract
Currently, many large language models (LLMs) are utilized for software engineering tasks such as code generation. The emergence of more advanced models known as large reasoning models (LRMs), such as OpenAI's o3, DeepSeek R1, and Qwen3. They have demonstrated the capability of performing multi-step reasoning. Despite the advancement in LRMs, little attention has been paid to systematically analyzing the reasoning patterns these models exhibit and how such patterns influence the generated code. This paper presents a comprehensive study aimed at investigating and uncovering the reasoning behavior of LRMs during code generation. We prompted several state-of-the-art LRMs of varying sizes with code generation tasks and applied open coding to manually annotate the reasoning traces. From this analysis, we derive a taxonomy of LRM reasoning behaviors, encompassing 15 reasoning actions across four phases.
Our empirical study based on the taxonomy reveals a series of findings. First, we identify common reasoning patterns, showing that LRMs generally follow a human-like coding workflow, with more complex tasks eliciting additional actions such as scaffolding, flaw detection, and style checks. Second, we compare reasoning across models, finding that Qwen3 exhibits iterative reasoning while DeepSeek-R1-7B follows a more linear, waterfall-like approach. Third, we analyze the relationship between reasoning and code correctness, showing that actions such as unit test creation and scaffold generation strongly support functional outcomes, with LRMs adapting strategies based on task context. Finally, we evaluate lightweight prompting strategies informed by these findings, demonstrating the potential of context- and reasoning-oriented prompts to improve LRM-generated code. Our results offer insights and practical implications for advancing automatic code generation.