Learning to Inject: Automated Prompt Injection via Reinforcement Learning

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the vulnerability of large language model (LLM) agents to prompt injection attacks, a challenge exacerbated by the limited scalability and transferability of existing automated methods that rely on manual red-teaming. To overcome this, the authors propose AutoInject, a novel framework that leverages reinforcement learning to automatically generate universal adversarial suffixes in a black-box setting, achieving high attack success rates while preserving the utility of benign tasks. Built upon a 1.5B-parameter suffix generator, AutoInject supports query-based optimization and cross-model transferability. Evaluated on the AgentDojo benchmark, it successfully compromises state-of-the-art models including GPT-5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash, establishing a stronger baseline for automated prompt injection research.

Technology Category

Application Category

📝 Abstract

Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.

Problem

Research questions and friction points this paper is trying to address.

prompt injection

large language models

automated attacks

adversarial suffixes

LLM agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt injection

reinforcement learning

adversarial suffix