🤖 AI Summary
Current evaluations of bias in code generation are largely confined to simple conditional statements, failing to capture the subtle biases present in real-world programming contexts. This work proposes a systematic evaluation framework grounded in machine learning pipelines, with a specific focus on the introduction of sensitive attributes during the feature selection stage. By testing both code-specific and general-purpose large language models across diverse prompts and complexity levels, the study reveals— for the first time—that existing assessment methods substantially underestimate real-world bias risks: 87.7% of generated ML pipelines incorporate sensitive attributes, markedly higher than the 59.2% detected using conditional-statement-based tests. This discrepancy persists robustly across multiple bias mitigation strategies, thereby challenging the validity of current bias evaluation paradigms.
📝 Abstract
Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection. Sensitive attributes appear in 87.7% of cases on average, despite models demonstrably excluding irrelevant features (e.g., including "race" while dropping "favorite color" for credit scoring). This bias is substantially more prevalent than that captured by conditional statements, where sensitive attributes appear in only 59.2% of cases. These findings are robust across prompt mitigation strategies, varying numbers of attributes, and different pipeline difficulty levels. Our results challenge simple conditionals as valid proxies for bias evaluation and suggest current benchmarks underestimate bias risk in practical deployments.