🤖 AI Summary
This study addresses **value alignment without model fine-tuning or dynamic prompt optimization**: how to guide large language models (LLMs) to generate value-congruent text via *static prompt design* alone. Methodologically, it formalizes target human values—grounded in Schwartz’s Theory of Basic Human Values—and constructs a structured dialogue dataset; it further proposes a reproducible, model-agnostic prompt evaluation framework that quantifies both the *presence* and *incremental gain* of target values in generated outputs. Experiments using a Wizard-Vicuna variant demonstrate that prompts explicitly conditioning on target values significantly improve value consistency over baseline prompts, with statistically significant alignment gains. The core contribution is the first *quantifiable, static-prompt evaluation paradigm* for dynamic value alignment—enabling lightweight, interpretable, and controllable value guidance in LLMs.
📝 Abstract
Large language models are increasingly used in applications where alignment with human values is critical. While model fine-tuning is often employed to ensure safe responses, this technique is static and does not lend itself to everyday situations involving dynamic values and preferences. In this paper, we present a practical, reproducible, and model-agnostic procedure to evaluate whether a prompt candidate can effectively steer generated text toward specific human values, formalising a scoring method to quantify the presence and gain of target values in generated responses. We apply our method to a variant of the Wizard-Vicuna language model, using Schwartz's theory of basic human values and a structured evaluation through a dialogue dataset. With this setup, we compare a baseline prompt to one explicitly conditioned on values, and show that value steering is possible even without altering the model or dynamically optimising prompts.