Who Does the Giant Number Pile Like Best: Analyzing Fairness in Hiring Contexts

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This study systematically investigates fairness of large language models (LLMs) in recruitment tasks—specifically resume summarization and retrieval. Addressing the gap in prior work, which predominantly examines discriminative tasks while overlooking bias in generative settings, we construct a controllable synthetic resume dataset and introduce demographic perturbation analysis to quantify racial and gender biases in generative recruitment for the first time. Our experiments reveal that approximately 10% of LLM-generated summaries exhibit significant racial bias, and 1% exhibit gender bias; all retrieval models display non-uniform candidate selection patterns and demonstrate equal sensitivity to both demographic and non-demographic perturbations—indicating that fairness vulnerabilities stem from inherent model instability rather than demographic-specific artifacts. These findings establish a new empirical benchmark for risk assessment and robust fairness optimization in LLM-based recruitment systems.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly being deployed in high-stakes applications like hiring, yet their potential for unfair decision-making and outcomes remains understudied, particularly in generative settings. In this work, we examine the fairness of LLM-based hiring systems through two real-world tasks: resume summarization and retrieval. By constructing a synthetic resume dataset and curating job postings, we investigate whether model behavior differs across demographic groups and is sensitive to demographic perturbations. Our findings reveal that race-based differences appear in approximately 10% of generated summaries, while gender-based differences occur in only 1%. In the retrieval setting, all evaluated models display non-uniform selection patterns across demographic groups and exhibit high sensitivity to both gender and race-based perturbations. Surprisingly, retrieval models demonstrate comparable sensitivity to non-demographic changes, suggesting that fairness issues may stem, in part, from general brittleness issues. Overall, our results indicate that LLM-based hiring systems, especially at the retrieval stage, can exhibit notable biases that lead to discriminatory outcomes in real-world contexts.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Hiring Bias

Text Generation Fairness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Fairness in Recruitment

Bias Sensitivity Analysis

🔎 Similar Papers

Fairness and Bias in Algorithmic Hiring: A Multidisciplinary Survey