Probing Persona-Dependent Preferences in Language Models

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This study investigates the internal representation mechanisms underlying preference formation in large language models under diverse persona assignments, addressing whether such preference representations are persona-specific or shared across personas. By training linear probes on residual stream activations to predict pairwise choice behaviors of Gemma-3-27B and Qwen-3.5-122B across multiple personas, and combining this with causal interventions via preference vector steering, the work identifies—for the first time—a universal preference vector that generalizes across personas. Remarkably, this vector remains predictive and causally effective even when applied to personas with preferences diametrically opposed to the default assistant persona (e.g., a “malevolent” persona). The strong generalization and controllability of this shared preference representation are rigorously validated on Gemma-3-27B.
📝 Abstract
Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with those of the Assistant.
Problem

Research questions and friction points this paper is trying to address.

persona
preference
language models
representation
probing
Innovation

Methods, ideas, or system contributions that make the work stand out.

preference vector
linear probing
persona-dependent preferences
residual stream
causal steering