May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work exposes the intrinsic vulnerability of fine-tuning-based prompt injection defenses under white-box settings. Motivated by the “instruction-data” fine-grained isolation assumption underlying state-of-the-art defenses such as SecAlign and StruQ, we propose the first architecture-aware, attention-driven attack. Our method jointly optimizes attention weight distributions and token budgets to craft highly stealthy injection prompts within input text. By exploiting fundamental architectural weaknesses—particularly in attention mechanisms—it achieves up to 70% attack success rate against both defense systems, with minimal token overhead (<0.5% average increase). The work not only reveals critical robustness limitations of current fine-tuning-based defenses but also advances prompt injection research toward architecture-aware adversarial theory. To foster reproducibility and further study, we publicly release our code and attack framework.

Technology Category

Application Category

📝 Abstract

A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data, so that the LLM does not follow instructions that might be present with data. There are several academic systems and production-level implementations of this idea. We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks and showing that the defenses do not provide the claimed security properties. Specifically, we construct a novel attention-based attack algorithm for text-based LLMs and apply it to two recent whitebox defenses SecAlign (CCS 2025) and StruQ (USENIX Security 2025), showing attacks with success rates of up to 70% with modest increase in attacker budget in terms of tokens. Our findings make fundamental progress towards understanding the robustness of prompt injection defenses in the whitebox setting. We release our code and attacks at https://github.com/nishitvp/better_opts_attacks

Problem

Research questions and friction points this paper is trying to address.

Evaluating robustness of fine-tuning based prompt injection defenses

Constructing attention-based attack algorithm for text-based LLMs

Testing defenses SecAlign and StruQ with high success rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-based attack algorithm for LLMs

Optimization-based attacks on fine-tuned defenses

Whitebox evaluation of SecAlign and StruQ

🔎 Similar Papers

No similar papers found.