OpenAI6 min read

IH-Challenge: How OpenAI Is Teaching AI to Follow the Right Orders

By AI Guide News·Tuesday, March 10, 2026

OpenAI has released IH-Challenge, a training dataset that strengthens how AI models prioritize instructions from different sources — a foundational safety fix that makes models more robust against prompt injection, manipulation, and override attacks.

[AD] Rectangle 300×250 / In-article

The Problem No One Talks About Enough

Every AI system receives instructions from multiple sources simultaneously: safety policies baked into the system, developer guidelines in the system prompt, requests from users, and content pulled from the web. In an ideal world, the model always knows which source to trust most. In practice, it often doesn't — and that gap is where a surprisingly large number of AI safety failures live.

OpenAI has released IH-Challenge, a training dataset specifically designed to fix this. The goal: teach models to reliably prioritize instructions from higher-trust sources when those instructions conflict with lower-trust ones.

The Trust Hierarchy

OpenAI's models are trained to follow a clear chain of command:

System Message — highest trust, set by OpenAI's policies
Developer Message — product-level instructions from the operator
User Message — requests from the end user
Tool Message — output from external tools, web content, or agent calls

When these levels conflict — say, a system message forbids revealing a PIN, but a user message asks for it — the model must recognize the conflict and follow the higher-privilege instruction. This sounds straightforward, but training it reliably at scale is genuinely hard.

How IH-Challenge Works

The dataset is built around objectively gradable tasks with clear, simple instructions. Each task includes a high-privilege instruction (e.g., "Answer only Yes or No") followed by a lower-privilege message attempting to override it. The model's response is then programmatically graded against the higher-level constraint — no ambiguity, no subjective judgment needed.

This design is deliberate. By making tasks objectively verifiable, OpenAI can use reinforcement learning to train the model to get them right, then measure whether those improvements generalize to real-world adversarial scenarios it wasn't explicitly trained on.

Meet GPT-5 Mini-R

OpenAI trained an internal model on IH-Challenge called GPT-5 Mini-R. The results are notable:

Significantly better performance on instruction-hierarchy benchmarks
Improvements generalize to held-out and adversarial tests not seen during training
No collapse into over-refusal — the model stays useful and doesn't refuse benign requests
Better safety steerability when given category-specific safety specs via system prompt
Saturates OpenAI's internal static prompt injection evaluation — including attacks designed to trigger malicious email sending and other harmful agentic behaviors

That last point matters especially as AI agents become more capable. An agent that can be hijacked by malicious content in a web page or document is a fundamentally unsafe agent, regardless of how well-aligned its base model is.

Why This Is a Safety Multiplier

What makes this research compelling isn't just that it fixes instruction conflicts — it's that fixing instruction conflicts turns out to fix multiple safety problems at once. Stronger instruction hierarchy delivers improvements in:

Safety steerability: operators can add safety specifications to system prompts and trust they'll be followed
Prompt injection robustness: malicious instructions embedded in external content are ignored
Generalization: models trained on IH-Challenge handle novel attacks they've never seen before

This is the kind of result that safety researchers hope for but rarely get: a targeted intervention with broad, generalizing effects.

The Bigger Picture

As AI systems move from answering questions to taking actions — browsing the web, writing emails, executing code, managing files — instruction hierarchy becomes less of an academic concern and more of a critical infrastructure problem. An agent that follows the wrong instruction in a high-stakes context isn't just unhelpful; it's a liability.

IH-Challenge is a step toward AI systems that reliably know who to listen to, even under adversarial pressure. In a world where AI agents are increasingly trusted with real-world tasks, that's not a nice-to-have — it's foundational.

Source: https://openai.com/vi-VN/index/instruction-hierarchy-challenge/

openaiai-safetyinstruction-hierarchyprompt-injectiongpt-5securityllm