Understanding Chain-of-Thought Effectiveness in Code Generation: An Empirical and Information-Theoretic Analysis

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The underlying mechanisms by which Chain-of-Thought (CoT) prompting enhances large language models’ (LLMs’) code generation performance remain poorly understood. Method: This study systematically investigates CoT’s efficacy through an information-theoretic lens—specifically, conditional mutual information (I(Y;C|X))—across a multi-scale model spectrum (7B–480B), six Python and twelve multilingual code-generation benchmarks, and complexity-stratified evaluation. Contribution/Results: We quantitatively establish that CoT effectiveness critically depends on programming language, model scale, and reasoning quality—not merely template structure. Structured CoT yields average Pass@1 improvements of 5–12% over zero-shot CoT, outperforming unstructured variants. Crucially, reasoning fidelity proves more decisive than syntactic formatting; low-quality CoT degrades performance. These findings provide empirically grounded, actionable guidelines for selecting optimal CoT strategies across model sizes and programming languages.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) achieve strong performance on code generation, but the mechanisms by which Chain-of-Thought (CoT) prompting helps remain unclear. We present a systematic empirical and information-theoretic study of CoT effectiveness in neural code generation, evaluating five paradigms (Zero-Shot, Zero-Shot CoT, Self-Planning, Structured CoT, Reasoning-CoT) across six Python benchmarks, a multilingual benchmark with 12 programming languages, and six models from 7B to 480B parameters, using conditional mutual information $I(Y;C|X)$ as a conceptual lens. Our results show that externally guided CoT consistently outperforms direct generation, with structured methods improving Pass@1 by 5--12% on average while using substantially fewer tokens than reflective reasoning, and that CoT benefits depend on language type systems and model capacity. We further find that reasoning emph{quality} is critical: high-quality structured CoT from strong generators yields significantly higher accuracy than lightweight alternatives with the same template, whereas naive Zero-Shot CoT can even degrade performance. These findings provide practical guidance for choosing CoT strategies based on model capacity, language characteristics, and task complexity.
Problem

Research questions and friction points this paper is trying to address.

Analyzes how Chain-of-Thought prompting improves code generation in LLMs.
Evaluates CoT effectiveness across models, languages, and benchmarks empirically.
Investigates the role of reasoning quality and structured guidance in CoT.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically evaluate five CoT paradigms across multiple benchmarks
Use conditional mutual information as conceptual lens for analysis
Show structured CoT improves accuracy with fewer tokens than reflective reasoning
🔎 Similar Papers
N
Naizhu Jin
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Z
Zhong Li
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
G
Guang Yang
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China
T
Tian Zhang
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Qingkai Zeng
Qingkai Zeng
Assistant Professor, Nankai University; University of Notre Dame
data miningnatural language processingknowledge graphlarge language models