Unsupervised Elicitation of Language Models

๐Ÿ“… 2025-06-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenge of acquiring high-quality human supervision for superhuman-capable language models, this paper proposes an unsupervised language model guidance framework. Its core innovation is the Internal Consistency Maximization (ICM) algorithmโ€”the first method to fully elicit model capabilities without any external annotations. ICM constructs training signals from self-generated labels and integrates self-supervised fine-tuning, self-label reward modeling, and reinforcement learning to replace conventional RLHF. On benchmarks including GSM8K-verification and TruthfulQA, the approach matches or exceeds performance achieved under gold-standard human supervision. The resulting Claude 3.5 Haiku assistant and its associated reward model both significantly outperform human-supervised baselines. This work establishes the first scalable, human-free general paradigm for autonomous alignment of superhuman models.

Technology Category

Application Category

๐Ÿ“ Abstract
To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised steering of superhuman language models
Eliminating reliance on human-generated supervision
Improving model performance beyond human-labeled data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised Internal Coherence Maximization (ICM) algorithm
Fine-tune models on self-generated labels
Outperforms human-supervised training methods
๐Ÿ”Ž Similar Papers
No similar papers found.