Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Current LLM alignment methods predominantly rely on pairwise preference comparisons, limiting their ability to leverage richer human feedback forms—such as multi-way comparisons and top-k rankings. To address this, we propose Ranked Choice Preference Optimization (RCPO), the first preference optimization framework that systematically incorporates ranked-choice modeling. RCPO unifies support for both multi-alternative comparisons and top-k ranking feedback. Grounded in maximum likelihood estimation, it accommodates two canonical ranking models—Multinomial Logit (utility-based) and Mallows-RMJ (rank-based)—and naturally subsumes leading methods including DPO and SimPO. Empirically, RCPO consistently outperforms strong baselines on AlpacaEval 2 and Arena-Hard benchmarks using Llama-3-8B-Instruct and Gemma-2-9B-it, demonstrating both effectiveness and generalizability across model architectures and evaluation protocols.

Technology Category

Application Category

📝 Abstract

Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiwise comparisons and top-$k$ rankings. We propose Ranked Choice Preference Optimization (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. The framework is flexible, supporting both utility-based and rank-based choice models. It subsumes several existing pairwise methods (e.g., DPO, SimPO), while providing principled training objectives for richer feedback formats. We instantiate this framework with two representative ranked choice models (Multinomial Logit and Mallows-RMJ). Empirical studies on Llama-3-8B-Instruct and Gemma-2-9B-it across AlpacaEval 2 and Arena-Hard benchmarks show that RCPO consistently outperforms competitive baselines. RCPO shows how directly leveraging ranked preference data, combined with the right choice models, yields more effective alignment. It offers a versatile and extensible foundation for incorporating (ranked) choice modeling into LLM training.

Problem

Research questions and friction points this paper is trying to address.

Extending pairwise preference optimization to multiwise comparisons

Unifying preference optimization with ranked choice modeling frameworks

Leveraging ranked preference data for more effective LLM alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Ranked Choice Preference Optimization framework

Uses maximum likelihood estimation for preference optimization

Supports utility-based and rank-based choice models

🔎 Similar Papers

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators