🤖 AI Summary
This paper addresses the value alignment challenge in large language models (LLMs) by proposing a lightweight, values-inquiry-based fine-tuning method that systematically bridges explicit value articulation and implicit downstream behavior. Methodologically, it constructs a “value profile” for open-source LLMs via fine-tuning on the Reddit Situated Moral Judgment dataset—a structured values questionnaire—and evaluates behavioral generalization in unseen scenarios using a text-adventure game environment. The key contribution is demonstrating that fine-tuning solely on structured value-inquiry questions enables cross-domain transfer of implicit value orientation. Empirical results show not only improved consistency in questionnaire responses but also significant shifts in moral judgment and interactive decision-making behavior. This approach establishes a novel, interpretable, and controllable paradigm for value alignment—achieving behavioral regulation through explicit, survey-style value elicitation without task-specific supervision. (149 words)
📝 Abstract
Large language models implicitly encode preferences over human values, yet steering them often requires large training data. In this work, we investigate a simple approach: Can we reliably modify a model's value system in downstream behavior by training it to answer value survey questions accordingly? We first construct value profiles of several open-source LLMs by asking them to rate a series of value-related descriptions spanning 20 distinct human values, which we use as a baseline for subsequent experiments. We then investigate whether the value system of a model can be governed by fine-tuning on the value surveys. We evaluate the effect of finetuning on the model's behavior in two ways; first, we assess how answers change on in-domain, held-out survey questions. Second, we evaluate whether the model's behavior changes in out-of-domain settings (situational scenarios). To this end, we construct a contextualized moral judgment dataset based on Reddit posts and evaluate changes in the model's behavior in text-based adventure games. We demonstrate that our simple approach can not only change the model's answers to in-domain survey questions, but also produces substantial shifts (value alignment) in implicit downstream task behavior.