🤖 AI Summary
This paper identifies a significant language-dependency bias in large language models (LLMs) for code generation: the correctness and runtime efficiency of generated code vary substantially when the same programming task is described in English versus Chinese.
Method: To systematically evaluate this phenomenon, the authors introduce the first bilingual-aligned Python benchmark—comprising 52 tasks—with rigorous bilingual task alignment. The framework integrates abstract syntax tree (AST) parsing and symbolic execution for functional correctness verification, and incorporates static time-complexity estimation to assess computational efficiency. Eight open-source LLMs and GPT-3.5-Turbo/GPT-4 are jointly evaluated under identical conditions.
Contribution/Results: Empirical results reveal an average correctness gap of 12% across models and statistically significant runtime efficiency discrepancies in 39% of tasks. This work provides the first systematic quantification of how natural language descriptions impact code generation quality, establishing a novel benchmark and methodology for fairness-aware evaluation and multilingual code-generation research.
📝 Abstract
Large Language Models (LLMs) have demonstrated promising capabilities for code generation. While existing benchmarks evaluate the correctness and efficiency of LLM-generated code, the potential linguistic bias - where code quality varies based on the natural language used to describe programming tasks - remains underexplored. In this paper, we aim to investigate this linguistic bias through the lens of English and Chinese. To facilitate our investigation, we present a unified evaluation framework comprising a curated dataset of 52 Python programming questions with parallel bilingual task descriptions, automated correctness verification, and efficiency quantification tools based on runtime complexity estimation. Based on this framework, we conduct the first empirical study towards the linguistic bias in LLM-generated code on eight popular LCGMs, as well as GPT-3.5-Turbo and GPT-4. We observe that these LCGM-generated code show different correctness on an average of 12% bilingual programming tasks, where 39% also exhibits diverse efficiency. Our findings indicate that LLMs commonly exhibit linguistic bias for code generation.