SteerX: Disentangled Steering for LLM Personalization

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing LLM personalization methods indiscriminately leverage all historical interaction data to construct activation-steering vectors, failing to distinguish genuine user preferences from noise—leading to distorted steering signals. To address this, we propose SteerX, the first approach to introduce causal inference into activation-space steering. SteerX estimates token-level causal effects to identify preference-driving tokens, thereby disentangling preference-aligned and non-preference components within activations, and aggregates purified preference signals to generate high-fidelity steering vectors. Crucially, SteerX requires no model fine-tuning and is computationally efficient. Extensive experiments across multiple real-world datasets demonstrate that SteerX significantly improves the vector quality of two mainstream steering paradigms—Sparse Autoencoders (SAEs) and Direct Preference Optimization (DPO)—yielding consistent gains in both accuracy and robustness of personalized generation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown remarkable success in recent years, enabling a wide range of applications, including intelligent assistants that support users' daily life and work. A critical factor in building such assistants is personalizing LLMs, as user preferences and needs vary widely. Activation steering, which directly leverages directions representing user preference in the LLM activation space to adjust its behavior, offers a cost-effective way to align the model's outputs with individual users. However, existing methods rely on all historical data to compute the steering vector, ignoring that not all content reflects true user preferences, which undermines the personalization signal. To address this, we propose SteerX, a disentangled steering method that isolates preference-driven components from preference-agnostic components. Grounded in causal inference theory, SteerX estimates token-level causal effects to identify preference-driven tokens, transforms these discrete signals into a coherent description, and then leverages them to steer personalized LLM generation. By focusing on the truly preference-driven information, SteerX produces more accurate activation steering vectors and enhances personalization. Experiments on two representative steering backbone methods across real-world datasets demonstrate that SteerX consistently enhances steering vector quality, offering a practical solution for more effective LLM personalization.

Problem

Research questions and friction points this paper is trying to address.

Disentangling preference-driven from preference-agnostic components in LLM activations

Identifying truly preference-driven tokens using causal inference methods

Improving activation steering vector quality for effective LLM personalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangles preference-driven from preference-agnostic components

Estimates token-level causal effects to identify preferences

Transforms discrete preference signals into coherent descriptions

🔎 Similar Papers

CoS: Enhancing Personalization and Mitigating Bias with Context Steering

2024-05-02arXiv.orgCitations: 4