Effectively Steer LLM To Follow Preference via Building Confident Directions

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing LLM personalization alignment methods rely on fine-tuning or explicit user instructions, suffering from high computational cost, poor controllability, and insufficient theoretical foundations. This paper proposes a fine-tuning-free, instruction-free inference-time activation control method: it constructs “confidence directions” in the model’s internal representation space—directions highly aligned with user preferences—to enable simultaneous steering across multiple preferences (>2). We introduce the first theoretical framework for preference-aligned directional guidance; support cross-layer, layer-agnostic, optimization-free vector superposition; and achieve significant improvements over existing bidirectional guidance methods on GPT-2 XL, Mistral, and Gemma-it. Our approach leverages preference quantification and directional modeling, demonstrating robust performance across diverse topics and stylistic tasks.

Technology Category

Application Category

📝 Abstract

Having an LLM that aligns with human preferences is essential for accommodating individual needs, such as maintaining writing style or generating specific topics of interest. The majority of current alignment methods rely on fine-tuning or prompting, which can be either costly or difficult to control. Model steering algorithms, which modify the model output by constructing specific steering directions, are typically easy to implement and optimization-free. However, their capabilities are typically limited to steering the model into one of the two directions (i.e., bidirectional steering), and there has been no theoretical understanding to guarantee their performance. In this work, we propose a theoretical framework to understand and quantify the model steering methods. Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations at inference time. More specifically, CONFST builds a confident direction that is closely aligned with users' preferences, and this direction is then added to the activations of the LLMs to effectively steer the model output. Our approach offers three key advantages over popular bidirectional model steering methods: 1) It is more powerful, since multiple (i.e. more than two) users' preferences can be aligned simultaneously; 2) It is simple to implement, since there is no need to determine which layer to add the steering vector to; 3) No explicit user instruction is required. We validate our method on GPT-2 XL (1.5B), Mistral (7B) and Gemma-it (9B) models for tasks that require shifting the output of LLMs across various topics and styles, achieving superior performance over competing methods.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with diverse human preferences efficiently.

Overcoming limitations of bidirectional model steering methods.

Implementing a simple, optimization-free model steering approach.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Confident direction steering modifies LLM activations.

Aligns multiple user preferences simultaneously.

No explicit user instructions required.

🔎 Similar Papers

No similar papers found.