ReqElicitGym: An Evaluation Environment for Interview Competence in Conversational Requirements Elicitation

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study addresses the lack of systematic, quantifiable, and reproducible evaluation methods for assessing the interviewing capabilities of large language models (LLMs) in conversational requirements elicitation. To this end, it introduces the first standardized automated evaluation framework tailored to requirements-gathering interviews, comprising a dataset of 101 scenarios, high-fidelity LLM-driven simulated users, and an automatic task evaluator that achieves strong alignment with real-user interactions and expert judgments across multiple dialogue quality dimensions. Empirical results reveal that current state-of-the-art models uncover fewer than half of implicit requirements, exhibit particularly poor performance on stylistic requirements, and tend to generate effective questions predominantly in the latter stages of dialogues.

Technology Category

Application Category

📝 Abstract

With the rapid improvement of LLMs'coding capabilities, the bottleneck of LLM-based automated software development is shifting from generating correct code to eliciting users'requirements. Despite growing interest, the interview competence of LLMs in conversational requirements elicitation remains fully underexplored. Existing evaluations often depend on a few scenarios, real user interaction, and subjective human scoring, which hinders systematic and quantitative comparison. To address these challenges, we propose ReqElicitGym, an interactive and automatic evaluation environment for assessing interview competence in conversational requirements elicitation. Specifically, ReqElicitGym introduces a new evaluation dataset and designs both an interactive oracle user and a task evaluator. The dataset contains 101 website requirements elicitation scenarios spanning 10 application types. Both the oracle user and the task evaluator achieve high agreement with real users and expert judgment. Using our ReqElicitGym, any automated conversational requirements elicitation approach (e.g., LLM-based agents) can be evaluated in a reproducible and quantitative manner through interaction with the environment. Based on our ReqElicitGym, we conduct a systematic empirical study on seven representative LLMs, and the results show that current LLMs still exhibit limited interview competence in uncovering implicit requirements. Particularly, they elicit less than half of the users'implicit requirements, and their effective elicitation questions often emerge in later turns of the dialogue. Besides, we found LLMs can elicit interaction and content implicit requirements, but consistently struggle with style-related requirements. We believe ReqElicitGym will facilitate the evaluation and development of automated conversational requirements elicitation.

Problem

Research questions and friction points this paper is trying to address.

requirements elicitation

interview competence

conversational AI

LLM evaluation

implicit requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

conversational requirements elicitation

LLM evaluation

interactive oracle user