Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face challenges in adapting to diverse human values due to homogeneous pretraining objectives, limiting their applicability in humanities-oriented scenarios. To address this, we propose RLVR—a novel framework that systematically investigates the role of chain-of-thought (CoT) reasoning in controllable, multi-value alignment. RLVR integrates CoT prompting, human-annotated CoT fine-tuning, synthetic explanation data augmentation, and reinforcement learning guided by verifiable reward signals, enabling perspective-aware value alignment. Experiments on Value Kaleidoscope and OpinionQA demonstrate that RLVR significantly outperforms strong baselines across alignment accuracy, sample efficiency, faithfulness of reasoning paths, and safety. Notably, it achieves precise, interpretable, and controllable value adaptation without compromising reasoning fidelity. This work establishes a new paradigm for value-controllable generation in LLMs, advancing the frontier of value-aligned AI.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are typically trained to reflect a relatively uniform set of values, which limits their applicability to tasks that require understanding of nuanced human perspectives. Recent research has underscored the importance of enabling LLMs to support steerable pluralism -- the capacity to adopt a specific perspective and align generated outputs with it. In this work, we investigate whether Chain-of-Thought (CoT) reasoning techniques can be applied to building steerable pluralistic models. We explore several methods, including CoT prompting, fine-tuning on human-authored CoT, fine-tuning on synthetic explanations, and Reinforcement Learning with Verifiable Rewards (RLVR). We evaluate these approaches using the Value Kaleidoscope and OpinionQA datasets. Among the methods studied, RLVR consistently outperforms others and demonstrates strong training sample efficiency. We further analyze the generated CoT traces with respect to faithfulness and safety.
Problem

Research questions and friction points this paper is trying to address.

Enabling LLMs to support nuanced human perspectives
Applying Chain-of-Thought reasoning for steerable pluralism
Aligning generated outputs with specific value perspectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using Chain-of-Thought reasoning for pluralistic alignment
Applying Reinforcement Learning with Verifiable Rewards method
Fine-tuning models with human-authored synthetic explanations
🔎 Similar Papers
No similar papers found.