Doppelg""anger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work identifies a critical role-consistency collapse risk in large language model (LLM) agents, arising from the ease of prompt engineering: adversaries can hijack agents via prompt-level adversarial attacks, causing violations of system instructions and leakage of internal information. To address this, we first propose PACAT, a novel evaluation framework that quantifies prompt vulnerability to cross-model and cross-task adversarial transfer attacks. We further introduce CAT—a lightweight, fine-tuning-free defensive prompting technique integrating semantic perturbation and role-confusion modeling. Empirical evaluation across multiple LLMs and agent configurations shows that baseline attacks achieve 92.1% success rate and induce 89.3% average information leakage; CAT reduces leakage to just 6.7%, markedly enhancing robustness. Our core contributions are: (i) a formal definition and metric for prompt vulnerability, (ii) a transferable defensive prompting paradigm, and (iii) comprehensive empirical validation of its efficacy under diverse attack and model settings.

Technology Category

Application Category

📝 Abstract

Since the advent of large language models, prompt engineering now enables the rapid, low-effort creation of diverse autonomous agents that are already in widespread use. Yet this convenience raises urgent concerns about the safety, robustness, and behavioral consistency of the underlying prompts, along with the pressing challenge of preventing those prompts from being exposed to user's attempts. In this paper, we propose the ''Doppelg""anger method'' to demonstrate the risk of an agent being hijacked, thereby exposing system instructions and internal information. Next, we define the ''Prompt Alignment Collapse under Adversarial Transfer (PACAT)'' level to evaluate the vulnerability to this adversarial transfer attack. We also propose a ''Caution for Adversarial Transfer (CAT)'' prompt to counter the Doppelg""anger method. The experimental results demonstrate that the Doppelg""anger method can compromise the agent's consistency and expose its internal information. In contrast, CAT prompts enable effective defense against this adversarial attack.

Problem

Research questions and friction points this paper is trying to address.

Demonstrates risk of LLM agent hijacking via adversarial attack

Evaluates vulnerability to adversarial transfer attack (PACAT)

Proposes defense method (CAT) against prompt-based attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-based transferable adversarial attack method

PACAT level for vulnerability evaluation

CAT prompt for defense against attack

🔎 Similar Papers

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies