🤖 AI Summary
This work identifies a pervasive failure of the hierarchical control mechanism in large language models (LLMs), wherein system instructions are intended—but frequently fail—to override user instructions. Addressing core deficiencies including inconsistent instruction priority and unreliable separation between system and user prompts, we introduce the first robustness evaluation framework for instruction hierarchy and propose the Constraint Priority Benchmark to systematically assess six state-of-the-art models under instruction conflicts. Empirical results demonstrate that all models exhibit unstable priority adherence: even minor formatting perturbations induce priority inversion. Fine-tuning and prompt engineering yield only marginal improvements. Our findings reveal an inherent constraint preference in current LLM paradigms, fundamentally challenging the foundational assumptions of hierarchical prompting and underscoring the necessity of architectural-level innovation to achieve reliable instruction governance.
📝 Abstract
Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. While controlled prompt engineering and model fine-tuning show modest improvements, our results indicate that instruction hierarchy enforcement is not robustly realized, calling for deeper architectural innovations beyond surface-level modifications.