🤖 AI Summary
This work addresses the limitations of current large language models in psychological interventions, which stem from uncalibrated activation spaces and restricted search ranges that hinder precise modulation of personality traits. To overcome these challenges, the authors propose a residual stream injection framework calibrated with established psychological scales, enabling accurate steering of the OCEAN personality dimensions under unconstrained yet fluency-preserving conditions. Integrating the IPIP-NEO-120 inventory, the approach introduces six intervention techniques—including Mean Difference (MD) injection and Personality Prompting (P²)—transcending conventional prompt engineering and establishing a new paradigm of representation engineering for open-ended psychological guidance. Experiments demonstrate that MD outperforms P² in 11 out of 14 models (by 3.6%–16.4%), while hybrid MD+P² strategies consistently achieve superior results across 13 models, with gains up to 26.7%, thereby validating the linear controllability of personality representations.
📝 Abstract
Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P$^2$), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P$^2$ and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P$^2$ ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.