IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to threats such as jailbreak attacks, system prompt extraction, and proxy-based prompt injection, which arise from a lack of robust mechanisms for prioritizing conflicting instructions among system directives, developer guidelines, user inputs, and tool commands. To tackle this, the authors introduce IH-Challenge, the first dataset specifically designed for training hierarchical instruction-following capabilities. By integrating online adversarial example generation with reinforcement learning fine-tuning on GPT-5-Mini, their approach effectively distinguishes between genuine instruction-following failures and breakdowns in hierarchical reasoning, thereby avoiding shortcut behaviors like excessive refusal. Evaluated across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks, the method improves average instruction hierarchy robustness by 10.0%, reduces unsafe behavior from 6.6% to 0.7%, and enhances both safety and helpfulness without significantly compromising general capabilities.

Technology Category

Application Category

📝 Abstract

Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.

Problem

Research questions and friction points this paper is trying to address.

instruction hierarchy

jailbreaks

prompt injection

LLM safety

instruction conflict

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction Hierarchy

Adversarial Training

Reinforcement Learning Dataset