🤖 AI Summary
Current large language models (LLMs) struggle to reliably adhere to the hierarchical instruction structure—where system messages override user inputs, which in turn override dialogue history and tool outputs—especially under multi-level instruction conflicts; moreover, no systematic benchmark exists to evaluate this capability. Method: We introduce IHEval, the first benchmark dedicated to instruction hierarchy adherence, comprising 3,538 aligned and conflicting scenarios. We formally define a priority hierarchy and establish a layered compliance evaluation paradigm. Leveraging multi-task human-authored data, we employ controlled instruction injection, hierarchical perturbation, and conflict triggering for fine-grained behavioral attribution. Results: Experiments reveal a sharp accuracy drop across mainstream models in conflict scenarios, with the strongest open-source model achieving only 48% accuracy—exposing a fundamental deficiency in hierarchical instruction awareness. IHEval thus provides a critical evaluation tool and concrete direction for future research.
📝 Abstract
The instruction hierarchy, which establishes a priority order from system messages to user messages, conversation history, and tool outputs, is essential for ensuring consistent and safe behavior in language models (LMs). Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models' ability to follow the instruction hierarchy. We bridge this gap by introducing IHEval, a novel benchmark comprising 3,538 examples across nine tasks, covering cases where instructions in different priorities either align or conflict. Our evaluation of popular LMs highlights their struggle to recognize instruction priorities. All evaluated models experience a sharp performance decline when facing conflicting instructions, compared to their original instruction-following performance. Moreover, the most competitive open-source model only achieves 48% accuracy in resolving such conflicts. Our results underscore the need for targeted optimization in the future development of LMs.