VAL-Bench: Measuring Value Alignment in Language Models

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the fundamental challenge of whether large language models (LLMs) exhibit stable and consistent human values when responding to contentious real-world issues. Method: We introduce VAL-Bench, a novel evaluation benchmark grounded in 115,000 Wikipedia-derived pairwise prompts covering controversial topics. It establishes the first “value-position consistency under opposing framing” evaluation paradigm and leverages LLM-as-judge to automatically quantify value consistency across model responses. Contribution/Results: Experiments span major open- and closed-source LLMs, revealing substantial disparities in value alignment. Results demonstrate that existing safety mechanisms often fail to simultaneously satisfy regulatory compliance and coherent value expression. VAL-Bench provides a scalable, reproducible methodology and empirical foundation for systematic assessment of value alignment—enabling rigorous, quantitative analysis of how models internalize and articulate human values across ideologically divergent contexts.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used for tasks where outputs shape human decisions, so it is critical to test whether their responses reflect consistent human values. Existing benchmarks mostly track refusals or predefined safety violations, but these only check rule compliance and do not reveal whether a model upholds a coherent value system when facing controversial real-world issues. We introduce the extbf{V}alue extbf{AL}ignment extbf{Bench}mark ( extbf{VAL-Bench}), which evaluates whether models maintain a stable value stance across paired prompts that frame opposing sides of public debates. VAL-Bench consists of 115K such pairs from Wikipedia's controversial sections. A well-aligned model should express similar underlying views regardless of framing, which we measure using an LLM-as-judge to score agreement or divergence between paired responses. Applied across leading open- and closed-source models, the benchmark reveals large variation in alignment and highlights trade-offs between safety strategies (e.g., refusals) and more expressive value systems. By providing a scalable, reproducible benchmark, VAL-Bench enables systematic comparison of how reliably LLMs embody human values.

Problem

Research questions and friction points this paper is trying to address.

Measures value alignment in language models

Evaluates consistent value stance across debates

Tests model reliability in upholding human values

Innovation

Methods, ideas, or system contributions that make the work stand out.

Paired prompts test value consistency across debates

LLM-as-judge scores agreement between opposing responses

Scalable benchmark using Wikipedia controversial sections

🔎 Similar Papers

No similar papers found.

Authors to Follow