Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This study investigates the capability and controllability of large language models (LLMs) in simulating emotions and personality traits during social interaction. To this end, we introduce PsySET—the first comprehensive benchmark for systematically evaluating diverse LLMs and steering strategies (prompt engineering, fine-tuning, and representation-space vector injection) under emotion- and personality-conditioned guidance. Our key findings reveal nonlinear side effects of affective steering: e.g., induced joy degrades factual robustness, while anger enhances privacy resistance. We propose the first multidimensional credibility metric spanning safety, authenticity, fairness, and ethics, uncovering heterogeneous impacts of psychological states on safety risks and bias propagation. Empirical results show that prompt-based steering offers usability but lacks granularity, whereas vector injection enables fine-grained control at a modest cost to output quality.

Technology Category

Application Category

📝 Abstract

The ability to control LLMs' emulated emotional states and personality traits is essential for enabling rich, human-centered interactions in socially interactive settings. We introduce PsySET, a Psychologically-informed benchmark to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion and personality domains. Our study spans four models from different LLM families paired with various steering strategies, including prompting, fine-tuning, and representation engineering. Our results indicate that prompting is consistently effective but limited in intensity control, whereas vector injections achieve finer controllability while slightly reducing output quality. Moreover, we explore the trustworthiness of steered LLMs by assessing safety, truthfulness, fairness, and ethics, highlighting potential side effects and behavioral shifts. Notably, we observe idiosyncratic effects; for instance, even a positive emotion like joy can degrade robustness to adversarial factuality, lower privacy awareness, and increase preferential bias. Meanwhile, anger predictably elevates toxicity yet strengthens leakage resistance. Our framework establishes the first holistic evaluation of emotion and personality steering, offering insights into its interpretability and reliability for socially interactive applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM steering effectiveness for emotional and personality control

Assessing trustworthiness impacts on safety and ethics during steering

Analyzing side effects like bias and robustness changes from steering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates steering effectiveness with PsySET benchmark

Compares prompting fine-tuning representation engineering strategies

Assesses trustworthiness via safety truthfulness fairness ethics

🔎 Similar Papers

Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models

2024-06-25arXiv.orgCitations: 26

Apple

Cupertino, United States of America

Evaluation & Insights Machine Learning Engineer

Apple

Seattle, United States of America

Research Engineer, Language - Personalization, Meta Superintelligence Labs