Humanline: Online Alignment as Perceptual Loss

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

The superiority of online alignment methods (e.g., GRPO) over offline ones (e.g., DPO) remains poorly understood. This work, grounded in behavioral economics’ prospect theory, posits that the core of online alignment lies in modeling human probability perception bias—i.e., humans’ irrational subjective assessment of generation outcome probabilities. To this end, we explicitly incorporate such perceptual biases into the alignment objective for the first time, transcending the conventional online/offline training dichotomy, and propose a general *humanline* design framework. This framework unifies mainstream objectives—including DPO, KTO, and GRPO—by augmenting them with a perceptual loss term to better approximate human judgment distributions. Experiments demonstrate that *humanline* variants match the performance of online methods on both verifiable and non-verifiable tasks, while enabling efficient, low-cost fully offline training.

Technology Category

Application Category

📝 Abstract

Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO) -- but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping -- originally introduced to just stabilize training -- recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating humanline variants of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts on both verifiable and unverifiable tasks.

Problem

Research questions and friction points this paper is trying to address.

Explaining why online alignment outperforms offline alignment methods

Proposing perceptual bias integration into alignment objectives like DPO/KTO

Enabling offline training to match online performance using human perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online alignment mimics human perceptual distribution

PPO/GRPO clipping recovers human probability perception bias

Humanline variants incorporate perceptual distortions into objectives

🔎 Similar Papers

Law of Vision Representation in MLLMs