🤖 AI Summary
The underlying mechanisms by which Chain-of-Thought (CoT) prompting enhances large language models’ (LLMs’) code generation performance remain poorly understood.
Method: This study systematically investigates CoT’s efficacy through an information-theoretic lens—specifically, conditional mutual information (I(Y;C|X))—across a multi-scale model spectrum (7B–480B), six Python and twelve multilingual code-generation benchmarks, and complexity-stratified evaluation.
Contribution/Results: We quantitatively establish that CoT effectiveness critically depends on programming language, model scale, and reasoning quality—not merely template structure. Structured CoT yields average Pass@1 improvements of 5–12% over zero-shot CoT, outperforming unstructured variants. Crucially, reasoning fidelity proves more decisive than syntactic formatting; low-quality CoT degrades performance. These findings provide empirically grounded, actionable guidelines for selecting optimal CoT strategies across model sizes and programming languages.
📝 Abstract
Large language models (LLMs) achieve strong performance on code generation, but the mechanisms by which Chain-of-Thought (CoT) prompting helps remain unclear. We present a systematic empirical and information-theoretic study of CoT effectiveness in neural code generation, evaluating five paradigms (Zero-Shot, Zero-Shot CoT, Self-Planning, Structured CoT, Reasoning-CoT) across six Python benchmarks, a multilingual benchmark with 12 programming languages, and six models from 7B to 480B parameters, using conditional mutual information $I(Y;C|X)$ as a conceptual lens. Our results show that externally guided CoT consistently outperforms direct generation, with structured methods improving Pass@1 by 5--12% on average while using substantially fewer tokens than reflective reasoning, and that CoT benefits depend on language type systems and model capacity. We further find that reasoning emph{quality} is critical: high-quality structured CoT from strong generators yields significantly higher accuracy than lightweight alternatives with the same template, whereas naive Zero-Shot CoT can even degrade performance. These findings provide practical guidance for choosing CoT strategies based on model capacity, language characteristics, and task complexity.