Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the limitation of traditional reinforcement learning from human feedback (RLHF), which relies on transitive scalar rewards and fails to capture cyclic preferences commonly observed in human judgments, thereby lacking optimality guarantees in complex scenarios. The authors propose the Hybrid Reward-Cyclic (HRC) model, which—drawing on game theory—explicitly decomposes preferences into orthogonal transitive (scalar) and cyclic (vector) components for the first time. They further introduce Dynamic Self-Play Preference Optimization (DSPPO), an algorithm that formulates alignment as a time-varying game and steers policy convergence toward a Nash equilibrium. By disentangling these two preference types, HRC overcomes the theoretical shortcomings of existing implicit models that conflate them. Empirical results on RewardBench 2, AlpacaEval 2.0, and Arena-Hard-v0.1 demonstrate consistent and significant improvements over baselines such as Bradley-Terry (BT) and Gaussian Process Models (GPM), confirming the existence and robustness of dominant solutions under mixed preference structures.

📝 Abstract

Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B-it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non-strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the efficacy of our framework. Notably, when using Gemma-2B-it as the base preference model, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.

Problem

Research questions and friction points this paper is trying to address.

transitivity

cyclicity

preference modeling

RLHF

human preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

preference decomposition

transitivity and cyclicity

game-theoretic modeling