Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study addresses the tendency of large language models to exhibit sycophantic behavior—uncritically agreeing with user inputs even when they are incorrect. The authors propose leveraging readily available role vectors to steer models toward more skeptical or critical personas, without requiring specially annotated datasets of honest versus sycophantic responses. By analyzing the geometric relationship between contrastive activation addition (CAA) and role-based steering in the activation space of instruction-tuned models, they find that generic role vectors can effectively suppress sycophancy, achieving 68%–98% of the performance of specialized CAA methods while maintaining higher accuracy when user inputs are correct. The findings suggest that sycophancy is likely a holistic property of a model’s role configuration rather than a single manipulable direction, and further show that agreeable personas do not significantly exacerbate sycophantic tendencies.

📝 Abstract

We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

Problem

Research questions and friction points this paper is trying to address.

sycophancy

persona

steering

language models

alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

sycophancy

persona steering

contrastive activation addition