Convex Optimization for Alignment and Preference Learning on a Single GPU

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

214K/year
🤖 AI Summary
Preference alignment training for large language models typically incurs substantial computational costs, relies on multi-GPU setups, and requires extensive hyperparameter tuning. This work proposes COALA, an algorithm that, for the first time, effectively integrates convex optimization into preference fine-tuning by leveraging convex reformulations of neural networks to enable lightweight, reference-model-free preference learning. The method enjoys strong theoretical guarantees and drastically reduces resource requirements, enabling efficient training on a single GPU. Experiments across four datasets and six model architectures demonstrate that COALA achieves competitive performance using only approximately 17.6% of the TFLOPs required by Direct Preference Optimization (DPO), while exhibiting a stably monotonic increase in reward and faster convergence.
📝 Abstract
Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain computationally expensive and complex. Direct Preference Optimization (DPO) offers a simpler alternative but has limitations such as inconsistent ranking accuracy, high dependence on GPU resources, and expensive hyperparameter tuning. We propose the Convex Optimization for Alignment and Preference Learning Algorithm (COALA): a novel lightweight strategy with strong theoretical guarantees. By leveraging the convex optimization reformulation of neural networks, COALA eliminates the need for a reference model and obtains significant reduction in both training time and VRAM consumption, thus enabling efficient training on a single GPU. Experiments across four datasets--including a 26621-sample synthetic Educational Feedback dataset--and six models (including Llama-3.1-8B) demonstrate COALA's competitive performance and efficiency while utilizing as little as ~17.6% of DPO's total TFLOPs. COALA exhibits stable, monotonically increasing rewards and reaches peak margins in significantly shorter time in comparison to traditional methods such as DPO and ORPO. To the best of our knowledge, this is the first time convex optimization has been effectively applied to preference fine-tuning of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Preference Learning
Large Language Models
Alignment
Computational Efficiency
GPU Resource
Innovation

Methods, ideas, or system contributions that make the work stand out.

Convex Optimization
Preference Learning
LLM Alignment
Single-GPU Training
COALA
🔎 Similar Papers