Overton Pluralistic Reinforcement Learning for Large Language Models

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the challenge that existing alignment paradigms struggle to model the pluralism of human values and generate diverse perspectives within a single model. The authors propose OP-GRPO, a two-stage implicit pluralism framework: first, a Sentence Transformer–based similarity estimator is trained to assess response coverage; then, Group Relative Policy Optimization is employed with a dual-reward mechanism—balancing coverage and uniqueness—to refine the policy. This enables a single large language model to produce richly varied and distinctive responses without explicit prompting or modular architecture. Evaluated on Qwen2.5-3B-Instruct, the method achieves a 37.4% relative accuracy gain on natural language inference benchmarks over a 20B-parameter GPT-OSS baseline and outperforms modular designs by 19.1%, with robustness confirmed via GPT-4.1 evaluation—establishing a novel “small model, broad perspective coverage” paradigm.

Technology Category

Application Category

📝 Abstract

Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a"small models, big perspective coverage"effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

Problem

Research questions and friction points this paper is trying to address.

pluralism

value alignment

large language models

diverse perspectives

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pluralistic Reinforcement Learning

OP-GRPO

Value Alignment