Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of intransitive (cyclic) preferences in multi-objective preference tuning, which undermines the assumption of a globally optimal policy inherent in conventional approaches. The study presents the first extension of Blackwell optimality to multi-objective settings with intransitive preferences, introducing a game-theoretic solution concept termed Maximum-Entropy Blackwell Winner (MaxEntBW). Building on this foundation, the authors propose PROSPER, an efficient, scalarization-free preference tuning algorithm that integrates an LLM-as-a-Judge feedback mechanism. Empirical evaluations demonstrate that PROSPER significantly outperforms existing baselines on both instruction-following and general dialogue benchmarks. To further validate its effectiveness and scalability, the authors release open-source models with 3B and 7B parameters.

Technology Category

Application Category

📝 Abstract
A recurring challenge in preference fine-tuning (PFT) is handling $\textit{intransitive}$ (i.e., cyclic) preferences. Intransitive preferences often stem from either $\textit{(i)}$ inconsistent rankings along a single objective or $\textit{(ii)}$ scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept -- the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$) -- that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive $\texttt{PROSPER}$: a provably efficient PFT algorithm. Unlike prior self-play techniques, $\texttt{PROSPER}$ directly handles multiple objectives without requiring scalarization. We then apply $\texttt{PROSPER}$ to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that $\texttt{PROSPER}$ outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales.
Problem

Research questions and friction points this paper is trying to address.

intransitive preferences
multi-objective preference fine-tuning
preference fine-tuning
cyclic preferences
optimal policy
Innovation

Methods, ideas, or system contributions that make the work stand out.

intransitive preferences
multi-objective preference fine-tuning
Maximum Entropy Blackwell Winner
PROSPER
LLM-as-a-Judge
🔎 Similar Papers
No similar papers found.