The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically investigates confidence modeling for large language models (LLMs) in code completion tasks. We propose an intrinsic confidence estimation framework grounded in perplexity, entropy, and mutual information. Our empirical analysis spans six mainstream programming languages, five open-source LLMs, and 1,008 real-world GitHub project code samples—enabling cross-language, cross-model, and cross-dataset evaluation. Key findings include: (1) strongly typed languages (e.g., Java) significantly reduce model perplexity, whereas weakly typed/scripting languages (e.g., Perl) yield lower confidence; (2) architectural differences among LLMs exert a substantially greater influence on confidence than dataset provenance; and (3) code comments provide only marginal confidence improvement. To our knowledge, this is the first study to quantitatively characterize such cross-dimensional confidence patterns in code generation. The results establish a reproducible, quantitative benchmark and practical guidelines for assessing the reliability of LLM-generated code.

Technology Category

Application Category

📝 Abstract
Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, code completion has been approached with diverse LLMs fine-tuned on code (code LLMs). The performance of code LLMs can be assessed with downstream and intrinsic metrics. Downstream metrics are usually employed to evaluate the practical utility of a model, but can be unreliable and require complex calculations and domain-specific knowledge. In contrast, intrinsic metrics such as perplexity, entropy, and mutual information, which measure model confidence or uncertainty, are simple, versatile, and universal across LLMs and tasks, and can serve as proxies for functional correctness and hallucination risk in LLM-generated code. Motivated by this, we evaluate the confidence of LLMs when generating code by measuring code perplexity across programming languages, models, and datasets using various LLMs, and a sample of 1008 files from 657 GitHub projects. We find that strongly-typed languages exhibit lower perplexity than dynamically typed languages. Scripting languages also demonstrate higher perplexity. Perl appears universally high in perplexity, whereas Java appears low. Code perplexity depends on the employed LLM, but not on the code dataset. Although code comments often increase perplexity, the language ranking based on perplexity is barely affected by their presence. LLM researchers, developers, and users can employ our findings to assess the benefits and suitability of LLM-based code completion in specific software projects based on how language, model choice, and code characteristics impact model confidence.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM confidence in code completion tasks
Assessing perplexity across programming languages and models
Understanding how code characteristics impact model uncertainty
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating code perplexity across languages and models
Using intrinsic metrics as proxies for correctness
Assessing LLM confidence in code completion tasks
🔎 Similar Papers
No similar papers found.