🤖 AI Summary
This study investigates whether large language models (LLMs) exhibit a human-like Dunning-Kruger effect (DKE) in programming tasks—specifically, whether lower-capability models overestimate their code generation proficiency in low-resource programming languages.
Method: We employ a multilingual evaluation framework to systematically compare model output confidence against actual execution accuracy across 12 programming languages—including mainstream ones (e.g., Python, Rust) and low-resource ones (e.g., Haskell, Elixir).
Contribution/Results: We find a strong negative correlation between model capability and overconfidence, with overconfidence markedly amplified in rare languages: the weakest model exhibits a 47.3% confidence–accuracy gap in Haskell, versus only 12.1% in Python. This reveals a previously uncharacterized dimension of LLM “metacognitive deficiency,” providing critical empirical evidence for improving self-calibration mechanisms and designing more trustworthy AI systems.
📝 Abstract
As artificial intelligence systems increasingly collaborate with humans in creative and technical domains, questions arise about the cognitive boundaries and biases that shape our shared agency. This paper investigates the Dunning-Kruger Effect (DKE), the tendency for those with limited competence to overestimate their abilities in state-of-the-art LLMs in coding tasks. By analyzing model confidence and performance across a diverse set of programming languages, we reveal that AI models mirror human patterns of overconfidence, especially in unfamiliar or low-resource domains. Our experiments demonstrate that less competent models and those operating in rare programming languages exhibit stronger DKE-like bias, suggesting that the strength of the bias is proportionate to the competence of the models.