🤖 AI Summary
This work addresses the limitation of existing test-time alignment methods in effectively leveraging the pairwise comparison structure inherent in human preference data, which often leads to suboptimal guidance of large language model outputs. To overcome this, the authors propose Pref-CTRL, the first approach that explicitly incorporates the pairwise structure of preference data into test-time alignment. Pref-CTRL introduces a multi-objective value function to more accurately model preference signals and employs gradient-based hidden state editing to enable lightweight, fine-tuning-free intervention. Experimental results demonstrate that Pref-CTRL outperforms RE-Control on two benchmark datasets and exhibits superior generalization capabilities on out-of-domain data.
📝 Abstract
Test-time alignment methods offer a promising alternative to fine-tuning by steering the outputs of large language models (LLMs) at inference time with lightweight interventions on their internal representations. Recently, a prominent and effective approach, RE-Control (Kong et al., 2024), has proposed leveraging an external value function trained over the LLM's hidden states to guide generation via gradient-based editing. While effective, this method overlooks a key characteristic of alignment tasks, i.e. that they are typically formulated as learning from human preferences between candidate responses. To address this, in this paper we propose a novel preference-based training framework, Pref-CTRL, that uses a multi-objective value function to better reflect the structure of preference data. Our approach has outperformed RE-Control on two benchmark datasets and showed greater generalization on out-of-domain datasets. Our source code is available at https://github.com/UTS-nlPUG/pref-ctrl.