Unsupervised Elicitation of Moral Values from Language Models

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the challenge of aligning language models with diverse human moral values in the absence of human-annotated moral ground-truth data. To this end, the authors propose the Internal Consistency Maximization (ICM) algorithm, which, for the first time, enables unsupervised automatic labeling and fine-tuning of pretrained language models’ moral judgments without reliance on human annotations or predefined ethical frameworks. Experimental results demonstrate that ICM significantly outperforms existing pretrained and chatbot baselines on benchmarks such as Norm Bank and ETHICS. Models fine-tuned using ICM-generated labels achieve performance comparable to or even surpassing that of models trained with human annotations, with notable improvements in justice- and commonsense-related moral dimensions. Furthermore, ICM reduces social bias error rates by over 50%, particularly along axes of race, socioeconomic status, and political orientation.

Technology Category

Application Category

📝 Abstract

As AI systems become pervasive, grounding their behavior in human values is critical. Prior work suggests that language models (LMs) exhibit limited inherent moral reasoning, leading to calls for explicit moral teaching. However, constructing ground truth data for moral evaluation is difficult given plural frameworks and pervasive biases. We investigate unsupervised elicitation as an alternative, asking whether pretrained (base) LMs possess intrinsic moral reasoning capability that can be surfaced without human supervision. Using the Internal Coherence Maximization (ICM) algorithm across three benchmark datasets and four LMs, we test whether ICM can reliably label moral judgments, generalize across moral frameworks, and mitigate social bias. Results show that ICM outperforms all pre-trained and chatbot baselines on the Norm Bank and ETHICS benchmarks, while fine-tuning on ICM labels performs on par with or surpasses those of human labels. Across theoretically motivated moral frameworks, ICM yields its largest relative gains on Justice and Commonsense morality. Furthermore, although chatbot LMs exhibit social bias failure rates comparable to their pretrained ones, ICM reduces such errors by more than half, with the largest improvements in race, socioeconomic status, and politics. These findings suggest that pretrained LMs possess latent moral reasoning capacities that can be elicited through unsupervised methods like ICM, providing a scalable path for AI alignment.

Problem

Research questions and friction points this paper is trying to address.

moral reasoning

unsupervised elicitation

language models

AI alignment

moral values

Innovation

Methods, ideas, or system contributions that make the work stand out.

unsupervised elicitation

moral reasoning

Internal Coherence Maximization