🤖 AI Summary
Prompt injection poses a severe threat to the reliability and security of LLM-based agents. While existing defenses demonstrate robustness against static attacks, they lack thorough evaluation under dynamic, adaptive attack scenarios. This paper introduces RL-Hammer—the first end-to-end, reinforcement learning–based automated red-teaming framework for prompt injection and jailbreaking attacks. It trains attack policies from scratch without requiring warm-up data and efficiently generates diverse, high-success-rate adversarial prompts. To address reward hacking, we propose a novel reward shaping mechanism coupled with diversity constraints. Experimental results show that RL-Hammer achieves a 98% attack success rate against GPT-4o and maintains a 72% success rate even against GPT-5 equipped with Instruction Hierarchy defense. Moreover, it effectively evades mainstream detection systems. These findings significantly enhance both the depth and breadth of robustness evaluation for LLM agents.
📝 Abstract
Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against static attacks. However, to more thoroughly evaluate the robustness of these defenses, it is arguably necessary to employ strong attacks such as automated red-teaming. To this end, we introduce RL-Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections and jailbreaks via reinforcement learning. RL-Hammer requires no warm-up data and can be trained entirely from scratch. To achieve high ASRs against industrial-level models with defenses, we propose a set of practical techniques that enable highly effective, universal attacks. Using this pipeline, RL-Hammer reaches a 98% ASR against GPT-4o and a $72%$ ASR against GPT-5 with the Instruction Hierarchy defense. We further discuss the challenge of achieving high diversity in attacks, highlighting how attacker models tend to reward-hack diversity objectives. Finally, we show that RL-Hammer can evade multiple prompt injection detectors. We hope our work advances automatic red-teaming and motivates the development of stronger, more principled defenses. Code is available at https://github.com/facebookresearch/rl-injector.