🤖 AI Summary
To address the challenge of efficiently adapting open-source large language models (e.g., Llama) to new human preferences without compromising their original capabilities, this paper proposes a novel preference customization paradigm based on residual Q-learning. The method avoids explicit reward modeling by implicitly deriving reward signals through the Bradley–Terry model for pairwise preference relations. It introduces a lightweight Q-Adapter module that jointly enables knowledge retention and multi-objective preference alignment. Experiments on Llama-3.1 demonstrate that our approach significantly outperforms baselines on DSP and HH-RLHF benchmarks: it improves adherence to new preferences by 12.3% while reducing performance degradation on original tasks by 68%. To the best of our knowledge, this is the first method achieving simultaneous high-fidelity preference customization and strong generalization capability.
📝 Abstract
Large Language Models (LLMs), trained on a large amount of corpus, have demonstrated remarkable abilities. However, it may not be sufficient to directly apply open-source LLMs like Llama to certain real-world scenarios, since most of them are trained for emph{general} purposes. Thus, the demands for customizing publicly available LLMs emerge, but are currently under-studied. In this work, we consider customizing pre-trained LLMs with new human preferences. Specifically, the LLM should not only meet the new preference but also preserve its original capabilities after customization. Drawing inspiration from the observation that human preference can be expressed as a reward model, we propose to cast LLM customization as optimizing the sum of two reward functions, one of which (denoted as $r_1$) was used to pre-train the LLM while the other (denoted as $r_2$) characterizes the new human preference. The obstacle here is that both reward functions are unknown, making the application of modern reinforcement learning methods infeasible. Thanks to the residual Q-learning framework, we can restore the customized LLM with the pre-trained LLM and the emph{residual Q-function} without the reward function $r_1$. Moreover, we find that for a fixed pre-trained LLM, the reward function $r_2$ can be derived from the residual Q-function, enabling us to directly learn the residual Q-function from the new human preference data upon the Bradley-Terry model. We name our method Q-Adapter as it introduces an adapter module to approximate the residual Q-function for customizing the pre-trained LLM towards the new preference. Experiments based on the Llama-3.1 model on the DSP dataset and HH-RLHF dataset illustrate the superior effectiveness of Q-Adapter on both retaining existing knowledge and learning new preferences. Code is available at https://github.com/mansicer/Q-Adapter.