Doppelg""anger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack

πŸ“… 2025-06-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work identifies a critical role-consistency collapse risk in large language model (LLM) agents, arising from the ease of prompt engineering: adversaries can hijack agents via prompt-level adversarial attacks, causing violations of system instructions and leakage of internal information. To address this, we first propose PACAT, a novel evaluation framework that quantifies prompt vulnerability to cross-model and cross-task adversarial transfer attacks. We further introduce CATβ€”a lightweight, fine-tuning-free defensive prompting technique integrating semantic perturbation and role-confusion modeling. Empirical evaluation across multiple LLMs and agent configurations shows that baseline attacks achieve 92.1% success rate and induce 89.3% average information leakage; CAT reduces leakage to just 6.7%, markedly enhancing robustness. Our core contributions are: (i) a formal definition and metric for prompt vulnerability, (ii) a transferable defensive prompting paradigm, and (iii) comprehensive empirical validation of its efficacy under diverse attack and model settings.

Technology Category

Application Category

πŸ“ Abstract
Since the advent of large language models, prompt engineering now enables the rapid, low-effort creation of diverse autonomous agents that are already in widespread use. Yet this convenience raises urgent concerns about the safety, robustness, and behavioral consistency of the underlying prompts, along with the pressing challenge of preventing those prompts from being exposed to user's attempts. In this paper, we propose the ''Doppelg""anger method'' to demonstrate the risk of an agent being hijacked, thereby exposing system instructions and internal information. Next, we define the ''Prompt Alignment Collapse under Adversarial Transfer (PACAT)'' level to evaluate the vulnerability to this adversarial transfer attack. We also propose a ''Caution for Adversarial Transfer (CAT)'' prompt to counter the Doppelg""anger method. The experimental results demonstrate that the Doppelg""anger method can compromise the agent's consistency and expose its internal information. In contrast, CAT prompts enable effective defense against this adversarial attack.
Problem

Research questions and friction points this paper is trying to address.

Demonstrates risk of LLM agent hijacking via adversarial attack
Evaluates vulnerability to adversarial transfer attack (PACAT)
Proposes defense method (CAT) against prompt-based attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-based transferable adversarial attack method
PACAT level for vulnerability evaluation
CAT prompt for defense against attack
πŸ”Ž Similar Papers
No similar papers found.
D
Daewon Kang
Research Institute for Future Medicine, Samsung Medical Center, Seoul, Republic of Korea
Y
YeongHwan Shin
Research Institute for Future Medicine, Samsung Medical Center, Seoul, Republic of Korea
D
Doyeon Kim
Department of Digital Health, SAIHST, Sungkyunkwan University, Seoul, Republic of Korea
Kyu-Hwan Jung
Kyu-Hwan Jung
Samsung Advanced Institute for Health Sciences and Technology, Sungkyunkwan University
Deep LearningMachine LearningArtificial IntelligenceMedical Image AnalysisDigital Health
M
Meong Hi Son
Department of Digital Health, SAIHST, Sungkyunkwan University, Seoul, Republic of Korea; Department of Emergency Medicine, Samsung Medical Center, Seoul, Republic of Korea