The Limits of Preference Data for Post-Training

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work investigates the fundamental limitations of ordinal preference feedback (e.g., pairwise comparisons) for optimizing large language model outputs in complex human feedback tasks—such as deep research or travel planning. Using the first formal integration of social choice theory (particularly voting theory) into RLHF analysis, and combining it with reinforcement learning theory and preference learning generalization bounds, we rigorously prove that even under ideal conditions—infinite data, zero noise, and online preference acquisition—post-training based solely on ordinal feedback cannot guarantee convergence to an approximately optimal policy. We further disentangle distinct failure modes across reasoning-oriented settings versus instruction tuning, exposing inherent unreliability in ordinal feedback. Our core contribution is establishing a theoretical bottleneck for RLHF in complex reasoning tasks, and demonstrating that overcoming it necessitates incorporating absolute (cardinal) scoring mechanisms and designing novel algorithms grounded in richer feedback structures.

Technology Category

Application Category

📝 Abstract

Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluation is qualitative and there are many possible degrees of success. One attractive and scalable modality for collecting human feedback is preference data: ordinal rankings (pairwise or $k$-wise) that indicate, for $k$ given outcomes, which one is preferred. In this work, we study a critical roadblock: preference data fundamentally and significantly limits outcome-based optimization. Even with idealized preference data (infinite, noiseless, and online), the use of ordinal feedback can prevent obtaining even approximately optimal solutions. We formalize this impossibility using voting theory, drawing an analogy between how a model chooses to answer a query with how voters choose a candidate to elect. This indicates that grounded human scoring and algorithmic innovations are necessary for extending the success of RL post-training to domains demanding human feedback. We also explore why these limitations have disproportionately impacted RLHF when it comes to eliciting reasoning behaviors (e.g., backtracking) versus situations where RLHF has been historically successful (e.g., instruction-tuning and safety training), finding that the limitations of preference data primarily suppress RLHF's ability to elicit robust strategies -- a class that encompasses most reasoning behaviors.

Problem

Research questions and friction points this paper is trying to address.

Limitations of preference data in optimizing outcomes

Preference data restricts RLHF in reasoning tasks

Need for human scoring to improve RL post-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using reinforcement learning with human feedback

Analyzing preference data limitations via voting theory

Exploring RLHF challenges in reasoning behaviors

🔎 Similar Papers

No similar papers found.