SteerX: Disentangled Steering for LLM Personalization

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM personalization methods indiscriminately leverage all historical interaction data to construct activation-steering vectors, failing to distinguish genuine user preferences from noise—leading to distorted steering signals. To address this, we propose SteerX, the first approach to introduce causal inference into activation-space steering. SteerX estimates token-level causal effects to identify preference-driving tokens, thereby disentangling preference-aligned and non-preference components within activations, and aggregates purified preference signals to generate high-fidelity steering vectors. Crucially, SteerX requires no model fine-tuning and is computationally efficient. Extensive experiments across multiple real-world datasets demonstrate that SteerX significantly improves the vector quality of two mainstream steering paradigms—Sparse Autoencoders (SAEs) and Direct Preference Optimization (DPO)—yielding consistent gains in both accuracy and robustness of personalized generation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown remarkable success in recent years, enabling a wide range of applications, including intelligent assistants that support users' daily life and work. A critical factor in building such assistants is personalizing LLMs, as user preferences and needs vary widely. Activation steering, which directly leverages directions representing user preference in the LLM activation space to adjust its behavior, offers a cost-effective way to align the model's outputs with individual users. However, existing methods rely on all historical data to compute the steering vector, ignoring that not all content reflects true user preferences, which undermines the personalization signal. To address this, we propose SteerX, a disentangled steering method that isolates preference-driven components from preference-agnostic components. Grounded in causal inference theory, SteerX estimates token-level causal effects to identify preference-driven tokens, transforms these discrete signals into a coherent description, and then leverages them to steer personalized LLM generation. By focusing on the truly preference-driven information, SteerX produces more accurate activation steering vectors and enhances personalization. Experiments on two representative steering backbone methods across real-world datasets demonstrate that SteerX consistently enhances steering vector quality, offering a practical solution for more effective LLM personalization.
Problem

Research questions and friction points this paper is trying to address.

Disentangling preference-driven from preference-agnostic components in LLM activations
Identifying truly preference-driven tokens using causal inference methods
Improving activation steering vector quality for effective LLM personalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangles preference-driven from preference-agnostic components
Estimates token-level causal effects to identify preferences
Transforms discrete preference signals into coherent descriptions
🔎 Similar Papers
No similar papers found.
X
Xiaoyan Zhao
The Chinese University of Hong Kong
M
Ming Yan
University of Science and Technology of China
Yilun Qiu
Yilun Qiu
National University of Singapore
H
Haoting Ni
University of Science and Technology of China
Y
Yang Zhang
National University of Singapore
F
Fuli Feng
National University of Singapore
Hong Cheng
Hong Cheng
Professor, The Chinese University of Hong Kong
Data MiningDatabaseMachine Learning
T
Tat-Seng Chua
National University of Singapore