Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This work addresses two key limitations in preference optimization: RLHF’s inability to model intransitive preferences and NLHF’s reliance on the restrictive synchronous equilibrium assumption. To overcome these, we propose Stackelberg Learning from Human Feedback (SLHF), a novel preference alignment framework grounded in sequential game theory. SLHF formalizes human feedback as a Stackelberg game wherein humans act as leaders committing to actions, and the model serves as a follower responding conditionally—thereby decoupling commitment from response and enabling iterative, inference-time refinement. Crucially, SLHF bypasses explicit reward modeling and reinforcement learning, relying solely on conditional policy learning and equilibrium computation. Experiments across 0.5B–8B parameter language models demonstrate strong cross-dataset alignment, significantly improved robustness to intransitive preferences, enhanced few-shot sensitivity, and zero-shot transferability of its inference-time refinement capability across distinct model families.

Technology Category

Application Category

📝 Abstract

We introduce Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader's action. This approach decomposes preference optimization into a refinement problem for the Follower and an optimization problem against an adversary for the Leader. Unlike Reinforcement Learning from Human Feedback (RLHF), which assigns scalar rewards to actions, or Nash Learning from Human Feedback (NLHF), which seeks a simultaneous-move equilibrium, SLHF leverages the asymmetry of sequential play to capture richer preference structures. The sequential design of SLHF naturally enables inference-time refinement, as the Follower learns to improve the Leader's actions, and these refinements can be leveraged through iterative sampling. We compare the solution concepts of SLHF, RLHF, and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Experiments on large language models demonstrate that SLHF achieves strong alignment across diverse preference datasets, scales from 0.5B to 8B parameters, and yields inference-time refinements that transfer across model families without further fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Optimizes alignment through sequential game between Leader and Follower policies

Captures richer preference structures via asymmetry in sequential play

Enables inference-time refinement without additional fine-tuning across models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential game framework for preference optimization

Decomposes alignment into leader-follower refinement tasks

Enables inference-time refinement without additional fine-tuning

🔎 Similar Papers

Bandits with Preference Feedback: A Stackelberg Game Perspective