Do Code Models Suffer from the Dunning-Kruger Effect?

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) exhibit a human-like Dunning-Kruger effect (DKE) in programming tasks—specifically, whether lower-capability models overestimate their code generation proficiency in low-resource programming languages. Method: We employ a multilingual evaluation framework to systematically compare model output confidence against actual execution accuracy across 12 programming languages—including mainstream ones (e.g., Python, Rust) and low-resource ones (e.g., Haskell, Elixir). Contribution/Results: We find a strong negative correlation between model capability and overconfidence, with overconfidence markedly amplified in rare languages: the weakest model exhibits a 47.3% confidence–accuracy gap in Haskell, versus only 12.1% in Python. This reveals a previously uncharacterized dimension of LLM “metacognitive deficiency,” providing critical empirical evidence for improving self-calibration mechanisms and designing more trustworthy AI systems.

Technology Category

Application Category

📝 Abstract

As artificial intelligence systems increasingly collaborate with humans in creative and technical domains, questions arise about the cognitive boundaries and biases that shape our shared agency. This paper investigates the Dunning-Kruger Effect (DKE), the tendency for those with limited competence to overestimate their abilities in state-of-the-art LLMs in coding tasks. By analyzing model confidence and performance across a diverse set of programming languages, we reveal that AI models mirror human patterns of overconfidence, especially in unfamiliar or low-resource domains. Our experiments demonstrate that less competent models and those operating in rare programming languages exhibit stronger DKE-like bias, suggesting that the strength of the bias is proportionate to the competence of the models.

Problem

Research questions and friction points this paper is trying to address.

AI models overestimate abilities in coding tasks

Less competent models show stronger overconfidence bias

Bias increases with unfamiliar programming languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed model confidence across programming languages

Revealed AI overconfidence mirrors human cognitive bias

Linked bias strength to model competence levels

🔎 Similar Papers

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates