🤖 AI Summary
This study addresses a critical limitation in current value alignment research, which predominantly treats human values as static and overlooks the dynamic impact of alignment interventions on the broader value system. To bridge this gap, the authors propose the “Value Alignment Tax” (VAT) framework, leveraging Schwartz’s theory of basic human values to construct a scenario-action dataset that enables the first quantitative assessment of systematic shifts in non-target values during alignment. Through normative judgment pairing, multidimensional value annotation, comparative analysis of alignment strategies, and modeling of value co-variation, the work reveals that alignment interventions often induce imbalanced yet structurally coherent shifts across interrelated values. This approach introduces a novel dimension for process-level risk assessment and dynamic understanding of value alignment in large language models.
📝 Abstract
Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value system. We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the dynamics of value expression under alignment pressure. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and alignment strategies. Our results show that alignment often produces uneven, structured co-movement among values. These effects are invisible under conventional target-only evaluation, revealing systemic, process-level alignment risks and offering new insights into the dynamics of value alignment in LLMs.