🤖 AI Summary
This study investigates the internal representation mechanisms underlying preference formation in large language models under diverse persona assignments, addressing whether such preference representations are persona-specific or shared across personas. By training linear probes on residual stream activations to predict pairwise choice behaviors of Gemma-3-27B and Qwen-3.5-122B across multiple personas, and combining this with causal interventions via preference vector steering, the work identifies—for the first time—a universal preference vector that generalizes across personas. Remarkably, this vector remains predictive and causally effective even when applied to personas with preferences diametrically opposed to the default assistant persona (e.g., a “malevolent” persona). The strong generalization and controllability of this shared preference representation are rigorously validated on Gemma-3-27B.
📝 Abstract
Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with those of the Assistant.