IHEval: Evaluating Language Models on Following the Instruction Hierarchy

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Current large language models (LLMs) struggle to reliably adhere to the hierarchical instruction structure—where system messages override user inputs, which in turn override dialogue history and tool outputs—especially under multi-level instruction conflicts; moreover, no systematic benchmark exists to evaluate this capability. Method: We introduce IHEval, the first benchmark dedicated to instruction hierarchy adherence, comprising 3,538 aligned and conflicting scenarios. We formally define a priority hierarchy and establish a layered compliance evaluation paradigm. Leveraging multi-task human-authored data, we employ controlled instruction injection, hierarchical perturbation, and conflict triggering for fine-grained behavioral attribution. Results: Experiments reveal a sharp accuracy drop across mainstream models in conflict scenarios, with the strongest open-source model achieving only 48% accuracy—exposing a fundamental deficiency in hierarchical instruction awareness. IHEval thus provides a critical evaluation tool and concrete direction for future research.

Technology Category

Application Category

📝 Abstract

The instruction hierarchy, which establishes a priority order from system messages to user messages, conversation history, and tool outputs, is essential for ensuring consistent and safe behavior in language models (LMs). Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models' ability to follow the instruction hierarchy. We bridge this gap by introducing IHEval, a novel benchmark comprising 3,538 examples across nine tasks, covering cases where instructions in different priorities either align or conflict. Our evaluation of popular LMs highlights their struggle to recognize instruction priorities. All evaluated models experience a sharp performance decline when facing conflicting instructions, compared to their original instruction-following performance. Moreover, the most competitive open-source model only achieves 48% accuracy in resolving such conflicts. Our results underscore the need for targeted optimization in the future development of LMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating language models on instruction hierarchy

Lack of benchmarks for instruction priority

Models struggle with conflicting instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

novel benchmark IHEval

evaluates instruction hierarchy

highlights model priority struggles

🔎 Similar Papers

No similar papers found.