Benchmarking Overton Pluralism in LLMs

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Evaluating value alignment in large language models (LLMs) requires quantifying their capacity to represent diverse societal viewpoints—a challenge unaddressed by existing metrics. Method: We propose OvertonScore, the first formalized, set-coverage metric quantifying Overton pluralism—the extent to which model outputs reflect socially diverse perspectives. Our evaluation framework integrates large-scale public surveys, expert human annotations, and an automated scoring model, validated via statistical correlation (Spearman ρ = 0.88) against human judgments. Contribution/Results: Empirical evaluation across eight state-of-the-art LLMs yields OvertonScore values ranging from 0.35 to 0.41, with DeepSeek-V3 achieving the highest score. OvertonScore establishes the first reproducible, scalable, and theoretically grounded quantitative benchmark for pluralistic value alignment, enabling efficient, standardized assessment of viewpoint diversity in LLM outputs.

Technology Category

Application Category

📝 Abstract

We introduce a novel framework for measuring Overton pluralism in LLMs--the extent to which diverse viewpoints are represented in model outputs. We (i) formalize Overton pluralism as a set coverage metric (OvertonScore), (ii) conduct a large-scale U.S.-representative human study (N = 1209; 60 questions; 8 LLMs), and (iii) develop an automated benchmark that closely reproduces human judgments. On average, models achieve OvertonScores of 0.35--0.41, with DeepSeek V3 performing best; yet all models remain far below the theoretical maximum of 1.0, revealing substantial headroom for improvement. Because repeated large-scale human studies are costly and slow, scalable evaluation tools are essential for model development. Hence, we propose an automated benchmark that achieves high rank correlation with human judgments ($ρ=0.88$), providing a practical proxy without replacing human assessment. By turning pluralistic alignment from a normative aim into a measurable benchmark, our work establishes a foundation for systematic progress toward more pluralistic LLMs.

Problem

Research questions and friction points this paper is trying to address.

Measures viewpoint diversity in LLM outputs

Develops automated benchmark for pluralism evaluation

Identifies gap between current models and ideal pluralism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework measures viewpoint diversity in LLMs

Automated benchmark replicates human judgment correlation

Metric quantifies pluralism as set coverage score

🔎 Similar Papers

No similar papers found.