Towards Smarter Hiring: Are Zero-Shot and Few-Shot Pre-trained LLMs Ready for HR Spoken Interview Transcript Analysis?

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates large language models (LLMs) for analyzing HR interview transcripts, focusing on three core tasks: scoring consistency, error detection, and generation of actionable feedback. To enable rigorous assessment, we introduce HURIT—the first domain-specific benchmark dataset for HR interviews, comprising 3,890 real-world transcripts. We conduct the first zero-shot and few-shot multidimensional evaluation of GPT-4 Turbo, GPT-3.5 Turbo, and Llama-2 on this benchmark and propose a human-AI collaborative evaluation framework. Results show that GPT-4 Turbo achieves high inter-rater agreement with human experts (Spearman’s ρ ≈ 0.85) in scoring, yet attains only 52.3% accuracy in error identification and generates actionable suggestions of substantially lower quality than human-written ones. GPT-3.5 Turbo performs moderately, while Llama-2 lags significantly. Overall, current LLMs lack sufficient reliability for fully automated HR interview assessment, underscoring the necessity of domain-adapted prompt engineering and closed-loop human feedback integration.

Technology Category

Application Category

📝 Abstract
This research paper presents a comprehensive analysis of the performance of prominent pre-trained large language models (LLMs), including GPT-4 Turbo, GPT-3.5 Turbo, text-davinci-003, text-babbage-001, text-curie-001, text-ada-001, llama-2-7b-chat, llama-2-13b-chat, and llama-2-70b-chat, in comparison to expert human evaluators in providing scores, identifying errors, and offering feedback and improvement suggestions to candidates during mock HR (Human Resources) interviews. We introduce a dataset called HURIT (Human Resource Interview Transcripts), which comprises 3,890 HR interview transcripts sourced from real-world HR interview scenarios. Our findings reveal that pre-trained LLMs, particularly GPT-4 Turbo and GPT-3.5 Turbo, exhibit commendable performance and are capable of producing evaluations comparable to those of expert human evaluators. Although these LLMs demonstrate proficiency in providing scores comparable to human experts in terms of human evaluation metrics, they frequently fail to identify errors and offer specific actionable advice for candidate performance improvement in HR interviews. Our research suggests that the current state-of-the-art pre-trained LLMs are not fully conducive for automatic deployment in an HR interview assessment. Instead, our findings advocate for a human-in-the-loop approach, to incorporate manual checks for inconsistencies and provisions for improving feedback quality as a more suitable strategy.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' performance in HR interview transcript analysis
Comparing LLMs and human experts in scoring and feedback quality
Assessing LLMs' limitations in error detection and actionable advice
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes pre-trained LLMs for HR transcript analysis
Introduces HURIT dataset with real-world HR interviews
Advocates human-in-the-loop for better feedback quality
🔎 Similar Papers
No similar papers found.