When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work proposes Confidence-Weighted Preference Optimization (CW-PO), a framework designed to reduce reliance on costly human annotations or strong-model APIs for aligning large language models. CW-PO leverages weak language models to generate high-confidence preference samples and dynamically reweights these samples during training based on their confidence scores. Notably, the study reveals that high-confidence samples from weak models can yield superior alignment performance compared to using the full set of human-annotated data. Experimental results demonstrate that CW-PO, when combined with only 20% of the original human annotations, outperforms standard Direct Preference Optimization (DPO) trained on 100% annotated data, achieving both higher alignment efficiency and improved performance.

Technology Category

Application Category

📝 Abstract

Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.

Problem

Research questions and friction points this paper is trying to address.

preference alignment

large language models

human annotations

weak LLMs

cost reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Confidence-Weighted Preference Optimization

weak LLMs

preference alignment