Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) lack proactive implicit error identification and response capabilities in the absence of explicit error-handling instructions. Method: We introduce Mis-prompt—the first dedicated benchmark for evaluating proactive error handling—comprising four representative task categories, an original taxonomy of implicit errors, and a high-quality dataset featuring both human-annotated and automated evaluations. We formally define and quantitatively assess LLMs’ proactive error handling ability, propose a multi-dimensional evaluation framework, and empirically validate improvement via supervised fine-tuning (SFT). Contribution/Results: Experimental results reveal that state-of-the-art LLMs exhibit consistently weak performance on implicit error handling; SFT significantly enhances both detection accuracy and response reasonableness. All data, annotations, evaluation protocols, and code will be publicly released to foster reproducible research.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated significant advancements in error handling. Current error-handling works are performed in a passive manner, with explicit error-handling instructions. However, in real-world scenarios, explicit error-handling instructions are usually unavailable. In this paper, our work identifies this challenge as how to conduct proactive error handling without explicit error handling instructions. To promote further research, this work introduces a new benchmark, termed Mis-prompt, consisting of four evaluation tasks, an error category taxonomy, and a new evaluation dataset. Furthermore, this work analyzes current LLMs' performance on the benchmark, and the experimental results reveal that current LLMs show poor performance on proactive error handling, and SFT on error handling instances improves LLMs' proactive error handling capabilities. The dataset will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Proactive error handling without explicit instructions
Benchmarking LLMs for real-world error scenarios
Improving LLM performance via error-handling SFT
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Mis-prompt benchmark for error handling
Evaluates LLMs without explicit error instructions
Shows SFT improves proactive error handling
🔎 Similar Papers
No similar papers found.
Jiayi Zeng
Jiayi Zeng
East China University
LLM EvaluationBenchmarking
Y
Yizhe Feng
Beihang University, Beijing, China
M
Mengliang He
East China Normal University, Shanghai, China
Wenhui Lei
Wenhui Lei
University of Pennsylvania
AI4HealthArtifical Intelligence
W
Wei Zhang
East China Normal University, Shanghai, China
Z
Zeming Liu
Beihang University, Beijing, China
X
Xiaoming Shi
East China Normal University, Shanghai, China
A
Aimin Zhou
East China Normal University, Shanghai, China