Learning to Inject: Automated Prompt Injection via Reinforcement Learning

πŸ“… 2026-02-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the vulnerability of large language model (LLM) agents to prompt injection attacks, a challenge exacerbated by the limited scalability and transferability of existing automated methods that rely on manual red-teaming. To overcome this, the authors propose AutoInject, a novel framework that leverages reinforcement learning to automatically generate universal adversarial suffixes in a black-box setting, achieving high attack success rates while preserving the utility of benign tasks. Built upon a 1.5B-parameter suffix generator, AutoInject supports query-based optimization and cross-model transferability. Evaluated on the AgentDojo benchmark, it successfully compromises state-of-the-art models including GPT-5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash, establishing a stronger baseline for automated prompt injection research.

Technology Category

Application Category

πŸ“ Abstract
Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
Problem

Research questions and friction points this paper is trying to address.

prompt injection
large language models
automated attacks
adversarial suffixes
LLM agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt injection
reinforcement learning
adversarial suffix
black-box attack
transferability
πŸ”Ž Similar Papers
No similar papers found.