🤖 AI Summary
This work addresses pervasive racial and gender biases in LLM-driven resume evaluation by proposing FAIRE—a first-of-its-kind fine-grained fairness benchmark tailored to hiring scenarios. Methodologically, it integrates LLM-based semantic modeling, identity-aware prompt engineering, and contrastive bias measurement. FAIRE enables controllable identity attribute perturbation, cross-model bias quantification, and dual-path bias detection—via direct scoring and ranking—to uncover implicit biases across multi-industry resume understanding tasks. Experimental results reveal significant, directionally heterogeneous biases across all mainstream LLMs. The project open-sources the benchmark dataset, evaluation code, and standardized protocols, establishing a reproducible, standardized infrastructure for fairness assessment of AI-powered recruitment tools.
📝 Abstract
In an era where AI-driven hiring is transforming recruitment practices, concerns about fairness and bias have become increasingly important. To explore these issues, we introduce a benchmark, FAIRE (Fairness Assessment In Resume Evaluation), to test for racial and gender bias in large language models (LLMs) used to evaluate resumes across different industries. We use two methods-direct scoring and ranking-to measure how model performance changes when resumes are slightly altered to reflect different racial or gender identities. Our findings reveal that while every model exhibits some degree of bias, the magnitude and direction vary considerably. This benchmark provides a clear way to examine these differences and offers valuable insights into the fairness of AI-based hiring tools. It highlights the urgent need for strategies to reduce bias in AI-driven recruitment. Our benchmark code and dataset are open-sourced at our repository: https://github.com/athenawen/FAIRE-Fairness-Assessment-In-Resume-Evaluation.git.