Users as Annotators: LLM Preference Learning from Comparison Mode

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of inconsistent preference annotation quality from users during LLM comparison interactions, which severely hampers alignment performance. We propose a dynamic quality-aware preference learning framework grounded in user behavioral modeling. Our method introduces (1) an asymmetric dual-model response generation mechanism to faithfully emulate real-world comparative decision-making; (2) a latent variable to represent annotation quality, jointly estimated with the preference model via the Expectation-Maximization (EM) algorithm; and (3) fully automated, label-free data filtering that adapts to intrinsic annotation reliability. Experiments demonstrate substantial improvements in preference data fidelity and alignment robustness, outperforming state-of-the-art methods across multiple alignment benchmarks—particularly under scenarios with noisy or low-quality user feedback. The approach establishes a novel paradigm for robust preference-based alignment without requiring manual quality labels.

Technology Category

Application Category

📝 Abstract
Pairwise preference data have played an important role in the alignment of large language models (LLMs). Each sample of such data consists of a prompt, two different responses to the prompt, and a binary label indicating which of the two responses is better. The labels are usually annotated by professional human annotators. In this paper, we consider an alternative approach to collect pairwise preference data -- user annotation from comparison mode. With the increasingly wider adoption of LLMs among the population, users are contributing more and more of their preference labels through their daily interactions with the LLMs. The upside of such labels is that users are the best experts in judging the responses to their own queries/prompts, but the downside is the lack of quality control in these labels. In this paper, we consider a new idea of generating two responses from two different models or two different versions of the same model. The asymmetry allows us to make an inference of the user's data quality through our proposed user behavior model. We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users' annotation data accordingly. The downstream task shows the effectiveness of our approach in both capturing the user behavior and data filtering for LLM alignment.
Problem

Research questions and friction points this paper is trying to address.

Collecting pairwise preference data through user annotations
Assessing data quality from user-provided comparison labels
Filtering user annotations for effective LLM alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collects pairwise preference data from user annotations
Infers user data quality via behavior modeling
Filters annotations using estimated latent quality factors
Z
Zhongze Cai
Imperial College Business School, Imperial College London
Xiaocheng Li
Xiaocheng Li
Imperial College Business School, Imperial College London
Machine learningoperations research