Adversarial Robustness of In-Context Learning in Transformers for Linear Regression

📅 2024-11-07

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work systematically investigates the adversarial robustness of Transformers against prompt hijacking attacks in in-context learning (ICL) for linear regression. Addressing the gap in understanding statistical learning vulnerabilities within ICL, we propose a gradient-guided hijacking attack. We theoretically prove the complete non-robustness of single-layer linear Transformers under such attacks and empirically demonstrate that GPT-2–family models are highly susceptible to single-example perturbations. We further reveal, for the first time, that attack transferability is constrained by model scale and initialization. Moreover, we show that lightweight adversarial fine-tuning significantly enhances robustness: even training with low-strength attacks yields generalizable defenses against stronger ones. Our findings provide both theoretical foundations for understanding intrinsic ICL vulnerabilities and practical pathways toward building robust large language models.

Technology Category

Application Category

📝 Abstract

Transformers have demonstrated remarkable in-context learning capabilities across various domains, including statistical learning tasks. While previous work has shown that transformers can implement common learning algorithms, the adversarial robustness of these learned algorithms remains unexplored. This work investigates the vulnerability of in-context learning in transformers to extit{hijacking attacks} focusing on the setting of linear regression tasks. Hijacking attacks are prompt-manipulation attacks in which the adversary's goal is to manipulate the prompt to force the transformer to generate a specific output. We first prove that single-layer linear transformers, known to implement gradient descent in-context, are non-robust and can be manipulated to output arbitrary predictions by perturbing a single example in the in-context training set. While our experiments show these attacks succeed on linear transformers, we find they do not transfer to more complex transformers with GPT-2 architectures. Nonetheless, we show that these transformers can be hijacked using gradient-based adversarial attacks. We then demonstrate that adversarial training enhances transformers' robustness against hijacking attacks, even when just applied during finetuning. Additionally, we find that in some settings, adversarial training against a weaker attack model can lead to robustness to a stronger attack model. Lastly, we investigate the transferability of hijacking attacks across transformers of varying scales and initialization seeds, as well as between transformers and ordinary least squares (OLS). We find that while attacks transfer effectively between small-scale transformers, they show poor transferability in other scenarios (small-to-large scale, large-to-large scale, and between transformers and OLS).

Problem

Research questions and friction points this paper is trying to address.

Investigating adversarial robustness of in-context learning in transformers to hijacking attacks

Comparing adversarial vulnerabilities between transformers and classical linear model algorithms

Analyzing transferability of adversarial attacks across different transformer models and training seeds

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial training improves transformer robustness

Transformers implement diverse in-context learning algorithms

Poor attack transfer between transformers and classical algorithms

🔎 Similar Papers

Loss Landscape Degeneracy Drives Stagewise Development in Transformers