Evaluating Non-English Developer Support in Machine Learning for Software Engineering

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
This study addresses the limited support of current large code models and their evaluation methodologies for non-English natural language elements, such as code comments. The authors systematically evaluate five prominent models—including CodeGemma and CodeLlama—on comment generation across Dutch, English, Greek, Polish, and Chinese. They introduce the first multilingual code comment dataset comprising 12,500 human-annotated samples and propose a fine-grained error taxonomy encompassing 26 error categories. Their findings reveal a substantial degradation in comment quality for non-English languages, with linguistic errors increasing by up to 15.1×. Moreover, existing automatic evaluation methods, including neural metrics and LLM-as-a-judge approaches, prove unreliable in detecting linguistic and semantic inaccuracies, underscoring the irreplaceable role of human judgment in evaluating multilingual code generation.
📝 Abstract
Large Language Models are increasingly used in software engineering, but both code generation and its evaluation remain predominantly English-centric. This leaves a major gap in our understanding of how well current tools support multilingual development, where code contains non-English natural language. In this paper, we investigate non-English code comment generation and the reliability of current methods for evaluating such outputs. We evaluate five code LLMs (CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2) across five natural languages: Dutch, English, Greek, Polish and Chinese. We further conduct an open-coding study of 12,500 generated comments, from which we derive a publicly released human-annotated dataset and a taxonomy of 26 error types. We use these human annotations, to evaluate the performance of neural metrics, and LLM-as-a-judge pipelines. Our findings show that generative performance deteriorates substantially outside English, with linguistic errors increasing by up to 15.1$\times$, alongside frequent incoherent generations and a rise in semantic errors. More critically, we show that detecting errors in non-English comments underperforms. Across classical overlap-based metrics, off-the-shelf neural metrics, extended neural metrics using newer multilingual, language-specific, and code-specific models, and LLM-as-a-judge pipelines, no automatic approach provides reliable and consistent assessment. Neural metrics fail to distinguish correct comments from incorrect outputs or even random noise, and tend to overestimate quality in non-English settings. LLM-as-a-judge methods achieve the highest agreement with human annotations but fail to reliably capture important language-related and semantic errors. Overall, our results show that evaluation and generation are key barriers for multilingual tooling, and that human judgment remains indispensable.
Problem

Research questions and friction points this paper is trying to address.

multilingual code generation
non-English developer support
code comment evaluation
LLM evaluation
software engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual code generation
non-English code comments
human-annotated error taxonomy
evaluation reliability
LLM-as-a-judge
J
Jonathan Katzy
Software Engineering Research Group, Delft University of Technology, The Netherlands
Y
Yongcheng Huang
Software Engineering Research Group, Delft University of Technology, The Netherlands
G
Gopal-Raj Panchu
Software Engineering Research Group, Delft University of Technology, The Netherlands
M
Maksym Ziemlewski
Software Engineering Research Group, Delft University of Technology, The Netherlands
P
Paris Loizides
Software Engineering Research Group, Delft University of Technology, The Netherlands
S
Sander Vermeulen
Software Engineering Research Group, Delft University of Technology, The Netherlands
Arie van Deursen
Arie van Deursen
Professor of Software Engineering, Delft University of Technology
Software engineeringsoftware testingempirical software engineeringdomain-specific languagesartificial intelligence
Maliheh Izadi
Maliheh Izadi
Assistant Professor @ Delft University of Technology, The Netherlands
Software engineeringEvaluationAI4SELLM4CodeAgents