🤖 AI Summary
This study addresses the limitations of current safety evaluations for personalized large language models, which predominantly focus on prompt injection while overlooking other intervention strategies such as activation steering that may expose novel vulnerabilities. Through a systematic comparison of prompt injection and activation steering across diverse model architectures, the work identifies and empirically validates a “prosocial personality paradox”: reliance on a single evaluation method leads to severe misjudgments of model safety. Integrating system prompts, activation steering, trait-based refusal alignment analysis, heuristic trajectory diagnostics, and large-scale human evaluations across 5,568 conditions, the research reveals significant safety reversals—e.g., an attack success rate (ASR) of 0.818 under activation steering in Llama-3.1-8B—demonstrating that model safety is highly contingent on both evaluation methodology and architectural design. The findings advocate for a multi-method joint evaluation framework.
📝 Abstract
Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures ($ρ= 0.71$--$0.96$), but activation-steering vulnerability diverges sharply and cannot be predicted from prompt-side rankings: Llama-3.1-8B is substantially more AS-vulnerable, whereas Gemma-3-27B and Qwen3.5 are more vulnerable to prompting. The most striking illustration of this divergence is the *prosocial persona paradox*: on Llama-3.1-8B, P12 (high conscientiousness + high agreeableness) is among the safest personas under prompting yet becomes the highest-ASR activation-steered persona (ASR ~0.818). This is an inversion robust to coefficient ablation and matched-strength calibration, and replicated on DeepSeek-R1-Distill-Qwen-32B. A trait refusal alignment framework, in which conscientiousness is strongly anti-aligned with refusal on Llama-3.1-8B, offers a partial geometric account. Reasoning provides only partial protection: two 32B reasoning models reach 15--18% prompt-side ASR, and activation steering separates them sharply in both baseline susceptibility and persona-specific vulnerability. Heuristic trace diagnostics suggest that the safer model retains stronger policy recall and self-correction behavior, not merely longer reasoning.