🤖 AI Summary
Existing LLM personalization methods indiscriminately leverage all historical interaction data to construct activation-steering vectors, failing to distinguish genuine user preferences from noise—leading to distorted steering signals. To address this, we propose SteerX, the first approach to introduce causal inference into activation-space steering. SteerX estimates token-level causal effects to identify preference-driving tokens, thereby disentangling preference-aligned and non-preference components within activations, and aggregates purified preference signals to generate high-fidelity steering vectors. Crucially, SteerX requires no model fine-tuning and is computationally efficient. Extensive experiments across multiple real-world datasets demonstrate that SteerX significantly improves the vector quality of two mainstream steering paradigms—Sparse Autoencoders (SAEs) and Direct Preference Optimization (DPO)—yielding consistent gains in both accuracy and robustness of personalized generation.
📝 Abstract
Large language models (LLMs) have shown remarkable success in recent years, enabling a wide range of applications, including intelligent assistants that support users' daily life and work. A critical factor in building such assistants is personalizing LLMs, as user preferences and needs vary widely. Activation steering, which directly leverages directions representing user preference in the LLM activation space to adjust its behavior, offers a cost-effective way to align the model's outputs with individual users. However, existing methods rely on all historical data to compute the steering vector, ignoring that not all content reflects true user preferences, which undermines the personalization signal. To address this, we propose SteerX, a disentangled steering method that isolates preference-driven components from preference-agnostic components. Grounded in causal inference theory, SteerX estimates token-level causal effects to identify preference-driven tokens, transforms these discrete signals into a coherent description, and then leverages them to steer personalized LLM generation. By focusing on the truly preference-driven information, SteerX produces more accurate activation steering vectors and enhances personalization. Experiments on two representative steering backbone methods across real-world datasets demonstrate that SteerX consistently enhances steering vector quality, offering a practical solution for more effective LLM personalization.