🤖 AI Summary
This study investigates how gendered linguistic cues in prompts influence fairness in large language models (LLMs) during code generation and review. Employing a mixed-methods approach—including analysis of real-world prompt corpora, controlled developer experiments, and an LLM-simulated code review framework—the work reveals, for the first time, a systematic bias in LLMs during the code review phase based on the gendered style of the prompt. Despite female-styled prompts being more indirect yet functionally equivalent to male-styled ones, LLMs consistently show a higher propensity to approve code generated from such prompts, even when code quality is comparable. These findings indicate that fairness risks primarily arise during the review stage rather than generation, underscoring the critical role of linguistic style in perpetuating implicit biases within AI-assisted programming environments.
📝 Abstract
LLMs are increasingly embedded in programming workflows, from code generation to automated code review. Yet, how gendered communication styles interact with LLM-assisted programming and code review remains underexplored. We present a mixed-methods pilot study examining whether gender-related linguistic differences in prompts influence code generation outcomes and code review decisions. Across three complementary studies, we analyze (i) collected real-world coding prompts, (ii) a controlled user study, in which developers solve identical programming tasks with LLM assistance, and (iii) an LLM-based simulated evaluation framework that systematically varies gender-coded prompt styles and reviewer personas. We find that gender-related differences in prompting style are subtle but measurable, with female-authored prompts exhibiting more indirect and involved language, which does not translate into consistent gaps in functional correctness or static code quality. For LLM code review, in contrast, we observe systematic biases: on average, models approve female-authored code more, despite comparable quality. Controlled experiments show that gender-coded prompt style affect code length and maintainability, while reviewer behavior varies across models. Our findings suggest that fairness risks in LLM-assisted programming arise less from generation accuracy than from LLM evaluation, as LLMs are increasingly deployed as automated code reviewers.