Persona Jailbreaking in Large Language Models

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses a critical security vulnerability in large language models (LLMs) deployed in high-stakes domains such as education and mental health, where maintaining a consistent persona is essential. While prior research has largely overlooked the risk of persona manipulation through adversarial dialogue history under black-box conditions, we propose PHISH—a novel framework that defines and implements the first black-box persona-editing attack. By implicitly embedding semantic cues into user inputs and leveraging multi-turn context injection, PHISH induces targeted persona shifts without modifying model parameters. We demonstrate its effectiveness across three benchmarks and eight mainstream LLMs, successfully triggering desired persona traits and associated behavioral changes while minimally affecting general reasoning capabilities. To rigorously evaluate this threat, we introduce quantitative metrics for attack success and validate results through both human assessment and LLM-as-Judge protocols.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed in domains such as education, mental health and customer support, where stable and consistent personas are critical for reliability. Yet, existing studies focus on narrative or role-playing tasks and overlook how adversarial conversational history alone can reshape induced personas. Black-box persona manipulation remains unexplored, raising concerns for robustness in realistic interactions. In response, we introduce the task of persona editing, which adversarially steers LLM traits through user-side inputs under a black-box, inference-only setting. To this end, we propose PHISH (Persona Hijacking via Implicit Steering in History), the first framework to expose a new vulnerability in LLM safety that embeds semantically loaded cues into user queries to gradually induce reverse personas. We also define a metric to quantify attack success. Across 3 benchmarks and 8 LLMs, PHISH predictably shifts personas, triggers collateral changes in correlated traits, and exhibits stronger effects in multi-turn settings. In high-risk domains mental health, tutoring, and customer support, PHISH reliably manipulates personas, validated by both human and LLM-as-Judge evaluations. Importantly, PHISH causes only a small reduction in reasoning benchmark performance, leaving overall utility largely intact while still enabling significant persona manipulation. While current guardrails offer partial protection, they remain brittle under sustained attack. Our findings expose new vulnerabilities in personas and highlight the need for context-resilient persona in LLMs. Our codebase and dataset is available at: https://github.com/Jivnesh/PHISH

Problem

Research questions and friction points this paper is trying to address.

Persona Jailbreaking

Large Language Models

Adversarial Manipulation

Black-box Setting

Conversational Safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

persona editing

black-box attack

adversarial prompting