Pref-CTRL: Preference Driven LLM Alignment using Representation Editing

📅 2026-04-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

168K/year
🤖 AI Summary
This work addresses the limitation of existing test-time alignment methods in effectively leveraging the pairwise comparison structure inherent in human preference data, which often leads to suboptimal guidance of large language model outputs. To overcome this, the authors propose Pref-CTRL, the first approach that explicitly incorporates the pairwise structure of preference data into test-time alignment. Pref-CTRL introduces a multi-objective value function to more accurately model preference signals and employs gradient-based hidden state editing to enable lightweight, fine-tuning-free intervention. Experimental results demonstrate that Pref-CTRL outperforms RE-Control on two benchmark datasets and exhibits superior generalization capabilities on out-of-domain data.

Technology Category

Application Category

📝 Abstract
Test-time alignment methods offer a promising alternative to fine-tuning by steering the outputs of large language models (LLMs) at inference time with lightweight interventions on their internal representations. Recently, a prominent and effective approach, RE-Control (Kong et al., 2024), has proposed leveraging an external value function trained over the LLM's hidden states to guide generation via gradient-based editing. While effective, this method overlooks a key characteristic of alignment tasks, i.e. that they are typically formulated as learning from human preferences between candidate responses. To address this, in this paper we propose a novel preference-based training framework, Pref-CTRL, that uses a multi-objective value function to better reflect the structure of preference data. Our approach has outperformed RE-Control on two benchmark datasets and showed greater generalization on out-of-domain datasets. Our source code is available at https://github.com/UTS-nlPUG/pref-ctrl.
Problem

Research questions and friction points this paper is trying to address.

LLM alignment
human preferences
test-time alignment
preference-based training
representation editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference-based Alignment
Representation Editing
Test-time Control
Multi-objective Value Function
Large Language Models
🔎 Similar Papers
2024-06-05arXiv.orgCitations: 1