Binary Classifier Optimization for Large Language Model Alignment

📅 2024-04-06

🏛️ arXiv.org

📈 Citations: 24

✨ Influential: 2

career value

194K/year

🤖 AI Summary

In real-world scenarios, users provide only sparse binary feedback (e.g., “like/dislike”), whereas mainstream alignment methods—such as DPO—rely on costly pairwise preference annotations. This work proposes the first efficient alignment paradigm requiring solely binary signals. Theoretically, we establish, for the first time, an equivalence between binary classification optimization and DPO loss minimization. Technically, we introduce two key mechanisms: reward shift and underlying distribution matching—jointly integrating binary classification modeling, logit-level implicit preference optimization, and distributional regularization. Experiments demonstrate that our method matches DPO and KTO performance on standard pairwise preference benchmarks, while achieving robust cross-model alignment across three diverse real-world binary-feedback datasets. Crucially, it significantly enhances alignment efficacy under low-cost, resource-constrained settings.

Technology Category

Application Category

📝 Abstract

Aligning Large Language Models (LLMs) to human preferences through preference optimization has been crucial but labor-intensive, necessitating for each prompt a comparison of both a chosen and a rejected text completion by evaluators. Recently, Kahneman-Tversky Optimization (KTO) has demonstrated that LLMs can be aligned using merely binary"thumbs-up"or"thumbs-down"signals on each prompt-completion pair. In this paper, we present theoretical foundations to explain the successful alignment achieved through these binary signals. Our analysis uncovers a new perspective: optimizing a binary classifier, whose logit is a reward, implicitly induces minimizing the Direct Preference Optimization (DPO) loss. In the process of this discovery, we identified two techniques for effective alignment: reward shift and underlying distribution matching. Consequently, we propose a new algorithm, extit{Binary Classifier Optimization}, that integrates the techniques. We validate our methodology in two settings: first, on a paired preference dataset, where our method performs on par with DPO and KTO; and second, on binary signal datasets simulating real-world conditions with divergent underlying distributions between thumbs-up and thumbs-down data. Our model consistently demonstrates effective and robust alignment across two base LLMs and three different binary signal datasets, showcasing the strength of our approach to learning from binary feedback.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM alignment using only binary user feedback

Bridging gap between binary signals and preference-based alignment methods

Validating effectiveness across diverse datasets and base LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary Classifier Optimization (BCO) for LLM alignment

Uses binary feedback to minimize DPO loss

Reward shift technique reduces loss gap

🔎 Similar Papers

No similar papers found.