ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) frequently exhibit premature answering in open-domain multi-turn dialogues due to ambiguous user inputs, yet prevailing clarification benchmarks are limited to single-turn or idealized cooperative settings, failing to assess genuine clarification capability in realistic multi-turn contexts. Method: We introduce ClarifyMT-Bench—the first multi-turn dialogue clarification benchmark—built upon a five-dimensional ambiguity taxonomy and six role-driven user personas, yielding 6,120 high-fidelity dialogues. We propose the ClarifyAgent framework, the first four-stage cognitive architecture (Perceive–Predict–Track–Plan) that explicitly decouples clarification reasoning. Annotation combines hybrid LLM-human verification with modular strategy modeling. Contribution/Results: Evaluation across ten mainstream LLMs reveals pervasive clarification deficits. ClarifyAgent improves multi-turn clarification accuracy by 32.7% on average and significantly enhances robustness in deep, context-sensitive dialogues.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly deployed as conversational assistants in open-domain, multi-turn settings, where users often provide incomplete or ambiguous information. However, existing LLM-focused clarification benchmarks primarily assume single-turn interactions or cooperative users, limiting their ability to evaluate clarification behavior in realistic settings. We introduce extbf{ClarifyMT-Bench}, a benchmark for multi-turn clarification grounded in a five-dimensional ambiguity taxonomy and a set of six behaviorally diverse simulated user personas. Through a hybrid LLM-human pipeline, we construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns. Evaluating ten representative LLMs uncovers a consistent under-clarification bias: LLMs tend to answer prematurely, and performance degrades as dialogue depth increases. To mitigate this, we propose extbf{ClarifyAgent}, an agentic approach that decomposes clarification into perception, forecasting, tracking, and planning, substantially improving robustness across ambiguity conditions. ClarifyMT-Bench establishes a reproducible foundation for studying when LLMs should ask, when they should answer, and how to navigate ambiguity in real-world human-LLM interactions.
Problem

Research questions and friction points this paper is trying to address.

Benchmarks multi-turn clarification in conversational LLMs
Addresses under-clarification bias in ambiguous user interactions
Improves robustness via agentic clarification decomposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn clarification benchmark with ambiguity taxonomy
Agentic approach decomposing clarification into four stages
Hybrid LLM-human pipeline for diverse dialogue construction
🔎 Similar Papers
No similar papers found.
Sichun Luo
Sichun Luo
City University of Hong Kong
recommender systemlarge language model
Y
Yi Huang
JIUTIAN Research, China Mobile
Mukai Li
Mukai Li
The University of Hong Kong
natural language processing
S
Shichang Meng
CityUHK
F
Fengyuan Liu
The University of Hong Kong
Z
Zefa Hu
JIUTIAN Research, China Mobile
Junlan Feng
Junlan Feng
Chief Scientist at China Mobile Research
Natural LanguageMachine LearningSpeech ProcessingData Mining
Q
Qi Liu
The University of Hong Kong