🤖 AI Summary
Prior work has not systematically explored large language models’ (LLMs) potential as active interviewers in autobiographical interviews, particularly lacking formal modeling and evaluation of goal-directedness, contextual coherence, and empathetic interaction. Method: We propose GuideLLM—the first LLM-based guided dialogue framework tailored for autobiographical interviews—formally defining three core dimensions of guided dialogue: goal navigation, context management, and empathetic interaction. We construct a multi-dimensional automated evaluation environment and a user-agent benchmark grounded in real autobiographical data. Contribution/Results: Through comparative experiments across multiple models, LLM-as-a-judge evaluation, and a 45-participant human study, GuideLLM significantly outperforms six state-of-the-art baselines—including GPT-4o and Llama-3-70b—in both interview quality and autobiographical narrative generation, establishing a new paradigm for LLM-driven guided dialogue.
📝 Abstract
Although Large Language Models (LLMs) succeed in human-guided conversations such as instruction following and question answering, the potential of LLM-guided conversations-where LLMs direct the discourse and steer the conversation's objectives-remains under-explored. In this study, we first characterize LLM-guided conversation into three fundamental components: (i) Goal Navigation; (ii) Context Management; (iii) Empathetic Engagement, and propose GuideLLM as an installation. We then implement an interviewing environment for the evaluation of LLM-guided conversation. Specifically, various topics are involved in this environment for comprehensive interviewing evaluation, resulting in around 1.4k turns of utterances, 184k tokens, and over 200 events mentioned during the interviewing for each chatbot evaluation. We compare GuideLLM with 6 state-of-the-art LLMs such as GPT-4o and Llama-3-70b-Instruct, from the perspective of interviewing quality, and autobiography generation quality. For automatic evaluation, we derive user proxies from multiple autobiographies and employ LLM-as-a-judge to score LLM behaviors. We further conduct a human-involved experiment by employing 45 human participants to chat with GuideLLM and baselines. We then collect human feedback, preferences, and ratings regarding the qualities of conversation and autobiography. Experimental results indicate that GuideLLM significantly outperforms baseline LLMs in automatic evaluation and achieves consistent leading performances in human ratings.