🤖 AI Summary
This study addresses a critical gap at the intersection of software engineering and AI ethics by systematically investigating implicit social biases—specifically gender, age, and racial biases—in code generation by large language models (LLMs).
Method: We propose the first bias-testing framework tailored to code generation, encompassing bias benchmark construction, multi-dimensional bias injection and detection, and a comparative experimental design involving five prompting strategies: zero-shot, one-shot, few-shot, chain-of-thought (CoT), and CoT with feedback.
Contribution/Results: Empirical evaluation across five mainstream LLMs reveals substantial bias in generated code (gender bias rates ranging from 13.47% to 49.10%). A key finding is that execution-level correction driven by feedback significantly outperforms prompt engineering alone—reducing GPT-4’s gender bias rate from 59.88% to 4.79%. Our work establishes a reproducible evaluation paradigm and actionable mitigation pathways for trustworthy code generation.
📝 Abstract
As the adoption of LLMs becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social bias and unfairness, such as those related to age, gender, and race? This issue concerns the integrity, fairness, and ethical foundation of software applications that depend on the code generated by these models but are underexplored in the literature. This paper presents a novel bias testing framework that is specifically designed for code generation tasks. Based on this framework, we conduct an extensive empirical study on the biases in code generated by five widely studied LLMs (i.e., PALM-2-CodeChat-bison, Claude-instant-1, GPT-3.5-turbo, GPT-4-turbo, and GPT-4). Our findings reveal that biases are prevalent. For example, 13.47% to 49.10% of the codes generated by these LLMs have biased behaviors towards gender. Moreover, we study five bias mitigation prompt strategies that are commonly used in current code generation scenarios, i.e., zero-shot, one-shot, few-shot, and two Chain-of-Thought (CoT) prompts, with and without provided feedback-driven refinement. Our evaluation results illustrate that using direct prompt engineering strategies has limited effectiveness in mitigating bias, but our test execution feedback can help to reduce the ratio of code biases to a large extent (e.g., from 59.88% to 4.79% for GPT-4)
1
.