ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing instruction-following research overlooks scenarios where user instructions contain conflicting constraints, leaving a critical gap in evaluating LLM robustness under such conditions. Method: We introduce ConInstruct—the first dedicated benchmark for conflict-aware instruction following—comprising manually curated, multi-scenario conflicting instructions, along with a rigorous evaluation framework based on F1-score and other metrics to assess conflict detection and resolution capabilities. Contribution/Results: Experiments reveal that state-of-the-art models (e.g., DeepSeek-R1, Claude-4.5-Sonnet) exhibit strong conflict detection performance (up to 91.5% F1), yet uniformly lack mechanisms for proactive clarification, prompting, or negotiation—exposing a fundamental robustness deficiency. This work identifies the risk of “latent failure” in instruction following, where models appear to comply while silently violating constraints. It establishes a novel evaluation paradigm and informs future design principles for trustworthy, interactive LLMs capable of handling ambiguous or contradictory user intent.

Technology Category

Application Category

📝 Abstract
Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs' ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs' conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to detect conflicting constraints in instructions
Assessing LLMs' performance in resolving conflicts within user instructions
Analyzing LLM behavior when instructions contain contradictory requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ConInstruct benchmark for conflict evaluation
Evaluates LLMs conflict detection and resolution capabilities
Analyzes models behavior with conflicting instruction constraints
🔎 Similar Papers
No similar papers found.
X
Xingwei He
The University of Hong Kong
Q
Qianru Zhang
The University of Hong Kong
P
Pengfei Chen
Xidian University
G
Guanhua Chen
Southern University of Science and Technology
Linlin Yu
Linlin Yu
University of Texas at Dallas
Uncertainty EstimationTrustworthy AIGraph Neural NetworkNLP
Y
Yuan Yuan
Beihang University,Qingdao Research Institute, Beihang University,Hangzhou Innovation Institute, Beihang University
Siu-Ming Yiu
Siu-Ming Yiu
Professor of Computer Science, The University of Hong Kong
CybersecurityCryptographyFinTechBioinformatics