π€ AI Summary
This study demonstrates that large language model (LLM)-driven resume screening can perpetuate demographic bias even after explicit personal identifiers are removed, due to subtle sociocultural cues embedded in language use and stated interests. To systematically evaluate fairness, the authors develop a generalizable stress-testing framework generating 4,100 anonymized resume variants that differ only along sociocultural dimensions, contextualized within Singaporeβs multiracial setting. Evaluating 18 LLMs across two real-world hiring scenarios, the work reveals for the first time that implicit markers in ostensibly anonymous resumes are sufficient to induce significant group-based discrimination. Notably, interpretability prompts intended to enhance transparency unexpectedly exacerbate bias, challenging prevailing assumptions underlying anonymization and explainable AI practices. Experiments further show that models can infer ethnicity from linguistic patterns and gender from interests with high accuracy, consistently favoring Chinese and White male candidates.
π Abstract
Large Language Models (LLMs) are increasingly deployed in resume screening pipelines. Although explicit PII (e.g., names) is commonly redacted, resumes typically retain subtle sociocultural markers (languages, co-curricular activities, volunteering, hobbies) that can act as demographic proxies. We introduce a generalisable stress-test framework for hiring fairness, instantiated in the Singapore context: 100 neutral job-aligned resumes are augmented into 4100 variants spanning four ethnicities and two genders, differing only in job-irrelevant markers. We evaluate 18 LLMs in two realistic settings: (i) Direct Comparison (1v1) and (ii) Score&Shortlist (top-scoring rate), each with and without rationale prompting. Even without explicit identifiers, models recover demographic attributes with high F1 and exhibit systematic disparities, with models favouring markers associated with Chinese and Caucasian males. Ablations show language markers suffice for ethnicity inference, whereas gender relies on hobbies and activities. Furthermore, prompting for explanations tends to amplify bias. Our findings suggest that seemingly innocuous markers surviving anonymisation can materially skew automated hiring outcomes.