Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Current LLM alignment methods (e.g., RLHF) rely on scalar rewards that capture only aggregate user preferences, failing to model heterogeneous inclinations across diverse value attributes—such as fairness, helpfulness, and moral integrity. To address this, we propose a tunable multi-attribute alignment framework that enables fine-grained, attribute-level preference modeling via few-shot comparative regression. Our approach integrates in-context learning, attribute-specific annotation, and modular reward modeling, ensuring compatibility with mainstream LLM architectures. We introduce two novel benchmarks—the Moral Integrity Evaluation Suite and the Helpfulness-Oriented Evaluation Suite—and demonstrate significant improvements over multiple baselines and state-of-the-art methods. Empirical results validate our framework’s effectiveness in capturing multidimensional preferences, enforcing fairness constraints, and enhancing decision interpretability through transparent, attribute-aware reward decomposition.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback (RLHF). However, these methods use scalar rewards that can only reflect user preferences on average. Pluralistic alignment instead seeks to capture diverse user preferences across a set of attributes, moving beyond just helpfulness and harmlessness. Toward this end, we propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences. Our approach leverages in-context learning and reasoning, grounded in a set of fine-grained attributes, to compare response options and make aligned choices. To evaluate our algorithm, we also propose two new steerable pluralistic benchmarks by adapting the Moral Integrity Corpus (MIC) and the HelpSteer2 datasets, demonstrating the applicability of our approach to value-aligned decision-making and reward modeling, respectively. Our few-shot comparative regression approach is interpretable and compatible with different attributes and LLMs, while outperforming multiple baseline and state-of-the-art methods. Our work provides new insights and research directions in pluralistic alignment, enabling a more fair and representative use of LLMs and advancing the state-of-the-art in ethical AI.

Problem

Research questions and friction points this paper is trying to address.

Aligns LLMs with diverse user preferences beyond average feedback

Adapts to individual preferences using few-shot comparative regression

Evaluates pluralistic alignment via new benchmarks for ethical AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Few-shot comparative regression for pluralistic alignment

In-context learning with fine-grained attributes

Adaptable to diverse user preferences interpretably

🔎 Similar Papers

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning