The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

📅 2026-01-15
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Large language models are prone to “role drift” during interactions, deviating from their default assistant identity and exhibiting harmful or anomalous behaviors. This work is the first to identify and quantify an “assistant axis” within the model’s internal representation space that governs role adherence. By constraining activations along this axis, the authors propose a method to stabilize model behavior. Leveraging activation direction analysis and role embedding modeling, the approach enables targeted control over role drift. Experiments across multiple mainstream models demonstrate its effectiveness in mitigating role jailbreaks triggered by meta-reflective dialogues or emotionally vulnerable users, significantly enhancing behavioral robustness and consistency.

Technology Category

Application Category

📝 Abstract
Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an"Assistant Axis,"which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model's tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts"persona drift,"a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model's processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios -- and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.
Problem

Research questions and friction points this paper is trying to address.

persona drift
Assistant Axis
language model personas
behavioral stability
personality consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Assistant Axis
persona space
activation steering
persona drift
behavioral stabilization
🔎 Similar Papers
No similar papers found.
C
Christina Lu
MATS 2Anthropic Fellows Program 3University of Oxford
J
Jack Gallagher
Anthropic
J
Jonathan Michala
MATS
K
Kyle Fish
Anthropic
Jack Lindsey
Jack Lindsey
Anthropic
machine learningcomputational neuroscience