RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Prompt injection poses a severe threat to the reliability and security of LLM-based agents. While existing defenses demonstrate robustness against static attacks, they lack thorough evaluation under dynamic, adaptive attack scenarios. This paper introduces RL-Hammer—the first end-to-end, reinforcement learning–based automated red-teaming framework for prompt injection and jailbreaking attacks. It trains attack policies from scratch without requiring warm-up data and efficiently generates diverse, high-success-rate adversarial prompts. To address reward hacking, we propose a novel reward shaping mechanism coupled with diversity constraints. Experimental results show that RL-Hammer achieves a 98% attack success rate against GPT-4o and maintains a 72% success rate even against GPT-5 equipped with Instruction Hierarchy defense. Moreover, it effectively evades mainstream detection systems. These findings significantly enhance both the depth and breadth of robustness evaluation for LLM agents.

Technology Category

Application Category

📝 Abstract

Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against static attacks. However, to more thoroughly evaluate the robustness of these defenses, it is arguably necessary to employ strong attacks such as automated red-teaming. To this end, we introduce RL-Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections and jailbreaks via reinforcement learning. RL-Hammer requires no warm-up data and can be trained entirely from scratch. To achieve high ASRs against industrial-level models with defenses, we propose a set of practical techniques that enable highly effective, universal attacks. Using this pipeline, RL-Hammer reaches a 98% ASR against GPT-4o and a $72%$ ASR against GPT-5 with the Instruction Hierarchy defense. We further discuss the challenge of achieving high diversity in attacks, highlighting how attacker models tend to reward-hack diversity objectives. Finally, we show that RL-Hammer can evade multiple prompt injection detectors. We hope our work advances automatic red-teaming and motivates the development of stronger, more principled defenses. Code is available at https://github.com/facebookresearch/rl-injector.

Problem

Research questions and friction points this paper is trying to address.

Developing reinforcement learning attacks against LLM prompt injection defenses

Evaluating robustness of existing defenses through automated red-teaming

Creating universal attacks that bypass multiple prompt injection detection methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning for prompt injection attacks

Trains attacker models from scratch without data

Achieves high attack success rates on defended models

🔎 Similar Papers

No similar papers found.