When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Confidence-Weighted Preference Optimization (CW-PO), a framework designed to reduce reliance on costly human annotations or strong-model APIs for aligning large language models. CW-PO leverages weak language models to generate high-confidence preference samples and dynamically reweights these samples during training based on their confidence scores. Notably, the study reveals that high-confidence samples from weak models can yield superior alignment performance compared to using the full set of human-annotated data. Experimental results demonstrate that CW-PO, when combined with only 20% of the original human annotations, outperforms standard Direct Preference Optimization (DPO) trained on 100% annotated data, achieving both higher alignment efficiency and improved performance.

Technology Category

Application Category

📝 Abstract
Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.
Problem

Research questions and friction points this paper is trying to address.

preference alignment
large language models
human annotations
weak LLMs
cost reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Confidence-Weighted Preference Optimization
weak LLMs
preference alignment
confidence weighting
DPO
A
Amirabbas Afzali
EPFL, Sharif University of Technology
Myeongho Jeon
Myeongho Jeon
EPFL, Postdoctoral Researcher
Large Language ModelSuperintelligence
M
Maria Brbic
EPFL