🤖 AI Summary
This study addresses the underexplored issue of gender bias in AI-powered code generation tools (CGTs). Using a mixed-subjects experimental design, we systematically investigate how gender moderates CGT usage across three dimensions: task performance (completion time, code correctness), subjective cognitive load (measured via the NASA-TLX scale), and fine-grained interaction behaviors (captured via screen recording and behavioral log analysis). Results reveal statistically significant gender differences: female developers exhibit higher cognitive load during complex programming tasks, adopt distinct tool reliance patterns, and demonstrate lower efficiency in specific interaction pathways. To our knowledge, this is the first empirical study to rigorously examine fairness and inclusivity in CGT usage from a gender perspective. The findings provide data-driven insights for improving prompt engineering, feedback mechanisms, and UI/UX adaptation—ultimately advancing equitable, human-AI collaborative software development practices.
📝 Abstract
Context: The increasing reliance on Code Generation Tools (CGTs), such as Windsurf and GitHub Copilot, are revamping programming workflows and raising critical questions about fairness and inclusivity. While CGTs offer potential productivity enhancements, their effectiveness across diverse user groups have not been sufficiently investigated. Objectives: We hypothesize that developers' interactions with CGTs vary based on gender, influencing task outcomes and cognitive load, as prior research suggests that gender differences can affect technology use and cognitive processing. Methods: The study will employ a mixed-subjects design with 54 participants, evenly divided by gender for a counterbalanced design. Participants will complete two programming tasks (medium to hard difficulty) with only CGT assistance and then with only internet access. Task orders and conditions will be counterbalanced to mitigate order effects. Data collection will include cognitive load surveys, screen recordings, and task performance metrics such as completion time, code correctness, and CGT interaction behaviors. Statistical analyses will be conducted to identify statistically significant differences in CGT usage. Expected Contributions: Our work can uncover gender differences in CGT interaction and performance among developers. Our findings can inform future CGT designs and help address usability and potential disparities in interaction patterns across diverse user groups. Conclusion: While results are not yet available, our proposal lays the groundwork for advancing fairness, accountability, transparency, and ethics (FATE) in CGT design. The outcomes are anticipated to contribute to inclusive AI practices and equitable tool development for all users.