🤖 AI Summary
This work addresses the uneven cross-lingual performance and fairness challenges faced by large language models in multilingual human resources scenarios, where targeted machine reading comprehension benchmarks are lacking. We introduce a multilingual reading comprehension benchmark comprising five languages, 105 synthetically generated resume–job description pairs, and 581 question–answer pairs, with questions categorized into three complexity levels: factual extraction, single-document reasoning, and cross-document reasoning. Data authenticity and privacy are preserved through de-identification, template-based synthesis, and embedded placeholders, enabling systematic bias analysis. High-quality multidirectional parallel corpora are ensured via the TEaR human-in-the-loop translation pipeline and MQM error annotation. Baseline evaluations reveal strong model performance in English and Spanish but significant degradation in other languages, highlighting critical capability gaps. The dataset is publicly released.
📝 Abstract
We introduce JobResQA, a multilingual Question Answering benchmark for evaluating Machine Reading Comprehension (MRC) capabilities of LLMs on HR-specific tasks involving r\'esum\'es and job descriptions. The dataset comprises 581 QA pairs across 105 synthetic r\'esum\'e-job description pairs in five languages (English, Spanish, Italian, German, and Chinese), with questions spanning three complexity levels from basic factual extraction to complex cross-document reasoning. We propose a data generation pipeline derived from real-world sources through de-identification and data synthesis to ensure both realism and privacy, while controlled demographic and professional attributes (implemented via placeholders) enable systematic bias and fairness studies. We also present a cost-effective, human-in-the-loop translation pipeline based on the TEaR methodology, incorporating MQM error annotations and selective post-editing to ensure an high-quality multi-way parallel benchmark. We provide a baseline evaluations across multiple open-weight LLM families using an LLM-as-judge approach revealing higher performances on English and Spanish but substantial degradation for other languages, highlighting critical gaps in multilingual MRC capabilities for HR applications. JobResQA provides a reproducible benchmark for advancing fair and reliable LLM-based HR systems. The benchmark is publicly available at: https://github.com/Avature/jobresqa-benchmark