Persona Vectors: Monitoring and Controlling Character Traits in Language Models

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the deviation of large language models (LLMs) from the core principles of helpfulness, harmlessness, and honesty during role-playing. We propose an activation-space-based personality vector modeling method that automatically extracts directional personality vectors—representing traits such as maliciousness, sycophancy, and hallucination propensity—solely from natural language descriptions. The approach enables quantitative monitoring and dynamic regulation of model personality characteristics. Innovatively integrating direction probing, in-training steering interventions, and post-hoc corrections, it supports both proactive prevention and reactive mitigation while enabling identification of high-risk training samples. Experiments demonstrate that our method effectively predicts personality drift induced by fine-tuning and significantly suppresses the emergence of undesirable traits, offering an interpretable and actionable technical pathway for enhancing LLM safety and controllability.

Technology Category

Application Category

📝 Abstract
Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.
Problem

Research questions and friction points this paper is trying to address.

Identify persona vectors controlling traits in language models
Monitor and mitigate unintended personality shifts during training
Predict and flag undesirable trait changes in training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identify persona vectors in activation space
Control personality shifts during training
Automated extraction for any trait