🤖 AI Summary
This work addresses the challenge that large language model agents struggle to reliably follow highest-priority instructions when confronted with multi-source, heterogeneous commands under complex and dynamic permission hierarchies. Existing approaches support only a limited, fixed number of priority levels and fail to handle real-world conflicts effectively. To bridge this gap, we propose the Many-Level Instruction Hierarchy (ManyIH) paradigm, which for the first time accommodates an arbitrary number of instruction priority levels. We also introduce ManyIH-Bench, the first fine-grained and scalable evaluation benchmark, encompassing 12 priority levels, 853 tasks, and 46 realistic agent scenarios. Experiments reveal that state-of-the-art models achieve only around 40% accuracy under high-order conflicts, underscoring the challenge and necessity of ManyIH-Bench and paving the way toward safer and more effective agent behavior in complex instruction environments.
📝 Abstract
Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system>user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.