Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This paper addresses the value alignment challenge in large language models (LLMs) by proposing a lightweight, values-inquiry-based fine-tuning method that systematically bridges explicit value articulation and implicit downstream behavior. Methodologically, it constructs a “value profile” for open-source LLMs via fine-tuning on the Reddit Situated Moral Judgment dataset—a structured values questionnaire—and evaluates behavioral generalization in unseen scenarios using a text-adventure game environment. The key contribution is demonstrating that fine-tuning solely on structured value-inquiry questions enables cross-domain transfer of implicit value orientation. Empirical results show not only improved consistency in questionnaire responses but also significant shifts in moral judgment and interactive decision-making behavior. This approach establishes a novel, interpretable, and controllable paradigm for value alignment—achieving behavioral regulation through explicit, survey-style value elicitation without task-specific supervision. (149 words)

Technology Category

Application Category

📝 Abstract

Large language models implicitly encode preferences over human values, yet steering them often requires large training data. In this work, we investigate a simple approach: Can we reliably modify a model's value system in downstream behavior by training it to answer value survey questions accordingly? We first construct value profiles of several open-source LLMs by asking them to rate a series of value-related descriptions spanning 20 distinct human values, which we use as a baseline for subsequent experiments. We then investigate whether the value system of a model can be governed by fine-tuning on the value surveys. We evaluate the effect of finetuning on the model's behavior in two ways; first, we assess how answers change on in-domain, held-out survey questions. Second, we evaluate whether the model's behavior changes in out-of-domain settings (situational scenarios). To this end, we construct a contextualized moral judgment dataset based on Reddit posts and evaluate changes in the model's behavior in text-based adventure games. We demonstrate that our simple approach can not only change the model's answers to in-domain survey questions, but also produces substantial shifts (value alignment) in implicit downstream task behavior.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLM behavior with human values via survey responses

Modifying model value systems through fine-tuning on surveys

Assessing value alignment in downstream tasks post-finetuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning LLMs with value survey questions

Constructing value profiles for baseline comparison

Evaluating behavior shifts in downstream tasks

🔎 Similar Papers

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning