π€ AI Summary
This work identifies a critical role-consistency collapse risk in large language model (LLM) agents, arising from the ease of prompt engineering: adversaries can hijack agents via prompt-level adversarial attacks, causing violations of system instructions and leakage of internal information. To address this, we first propose PACAT, a novel evaluation framework that quantifies prompt vulnerability to cross-model and cross-task adversarial transfer attacks. We further introduce CATβa lightweight, fine-tuning-free defensive prompting technique integrating semantic perturbation and role-confusion modeling. Empirical evaluation across multiple LLMs and agent configurations shows that baseline attacks achieve 92.1% success rate and induce 89.3% average information leakage; CAT reduces leakage to just 6.7%, markedly enhancing robustness. Our core contributions are: (i) a formal definition and metric for prompt vulnerability, (ii) a transferable defensive prompting paradigm, and (iii) comprehensive empirical validation of its efficacy under diverse attack and model settings.
π Abstract
Since the advent of large language models, prompt engineering now enables the rapid, low-effort creation of diverse autonomous agents that are already in widespread use. Yet this convenience raises urgent concerns about the safety, robustness, and behavioral consistency of the underlying prompts, along with the pressing challenge of preventing those prompts from being exposed to user's attempts. In this paper, we propose the ''Doppelg""anger method'' to demonstrate the risk of an agent being hijacked, thereby exposing system instructions and internal information. Next, we define the ''Prompt Alignment Collapse under Adversarial Transfer (PACAT)'' level to evaluate the vulnerability to this adversarial transfer attack. We also propose a ''Caution for Adversarial Transfer (CAT)'' prompt to counter the Doppelg""anger method. The experimental results demonstrate that the Doppelg""anger method can compromise the agent's consistency and expose its internal information. In contrast, CAT prompts enable effective defense against this adversarial attack.