LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation

๐Ÿ“… 2024-12-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Traditional LLM evaluation suffers from low reproducibility, labor-intensive procedures, and inconsistent results. To address these issues, this paper proposes a dynamic interview-style evaluation paradigm wherein an LLM acts as an active interviewer, generating questions adaptively across multiple rounds of interaction, incorporating real-time feedback and iterative probing while avoiding data contamination. We introduce the novel โ€œLLM-as-Interviewerโ€ framework, integrated with structured Interview Reports for granular strength/weakness analysis. The framework incorporates multi-turn dialogue modeling, feedback-driven follow-up question generation, task-specific adaptation to benchmarks (e.g., MATH, DepthQA), and dynamic dataset reconstruction. Evaluated on six mainstream LLMs, our method significantly improves assessment consistency, precisely characterizes response quality, feedback comprehension, and knowledge clarification capability, and enhances both practical applicability and analytical depth of LLM evaluation.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce LLM-as-an-Interviewer, a novel paradigm for evaluating large language models (LLMs). This approach leverages multi-turn interactions where the LLM interviewer actively provides feedback on responses and poses follow-up questions to the evaluated LLM. At the start of the interview, the LLM interviewer dynamically modifies datasets to generate initial questions, mitigating data contamination. We apply the LLM-as-an-Interviewer framework to evaluate six models on the MATH and DepthQA tasks. Our results show that the framework effectively provides insights into LLM performance, including the quality of initial responses, adaptability to feedback, and ability to address follow-up queries like clarification or additional knowledge requests. The framework also addresses key limitations of conventional methods like LLM-as-a-Judge, including verbosity bias and inconsistency across runs. Finally, we propose the Interview Report, which aggregates insights from the interview process, providing examples and a comprehensive analysis of the LLM's strengths and weaknesses. This report offers a detailed snapshot of the model's real-world applicability. The code for our framework is publicly available at https://github.com/interview-eval/.
Problem

Research questions and friction points this paper is trying to address.

Dynamic Assessment
Large Language Models
Conversation Performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Assessment
Large Language Models
Interview Simulation
๐Ÿ”Ž Similar Papers
No similar papers found.