HyperDPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework

📅 2024-10-10

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This paper addresses the multi-objective fine-tuning (MOFT) problem by proposing a novel framework that jointly optimizes heterogeneous objectives—such as learning-to-rank (LTR) and large language model (LLM) alignment—in a single training pass. Methodologically, it introduces (1) a conditional one-shot Direct Preference Optimization (DPO) algorithm grounded in the Plackett–Luce model, unifying multi-objective preference modeling as ordered choice probability estimation; and (2) hyper-prompt tuning coupled with a temperature-conditioned network, enabling continuous, differentiable control of objective weights and post-training trade-off adjustment without architectural modification. Experiments demonstrate that the method efficiently constructs the Pareto frontier for both LTR and LLM alignment tasks. Crucially, it supports flexible, deployment-time objective trade-offs using only a single training run—substantially improving computational efficiency and engineering practicality.

Technology Category

Application Category

📝 Abstract

In LLM alignment and many other ML applications, one often faces the Multi-Objective Fine-Tuning (MOFT) problem, i.e. fine-tuning an existing model with datasets labeled w.r.t. different objectives simultaneously. To address the challenge, we propose the HyperDPO framework, a conditioned one-shot fine-tuning approach that extends the Direct Preference Optimization (DPO) technique, originally developed for efficient LLM alignment with preference data, to accommodate the MOFT settings. By substituting the Bradley-Terry-Luce model in DPO with the Plackett-Luce model, our framework is capable of handling a wide range of MOFT tasks that involve listwise ranking datasets. Compared with previous approaches, HyperDPO enjoys an efficient one-shot training process for profiling the Pareto front of auxiliary objectives, and offers post-training control over trade-offs. Additionally, we propose a novel Hyper Prompt Tuning design, that conveys continuous importance weight across objectives to transformer-based models without altering their architecture, and investigate the potential of temperature-conditioned networks for enhancing the flexibility of post-training control. We demonstrate the effectiveness and efficiency of the HyperDPO framework through its applications to various tasks, including Learning-to-Rank (LTR) and LLM alignment, highlighting its viability for large-scale ML deployments.

Problem

Research questions and friction points this paper is trying to address.

Multi-Objective Fine-Tuning (MOFT) challenge in LLM alignment

Efficient one-shot training for Pareto front profiling

Flexible post-training control over trade-offs between objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Direct Preference Optimization for multi-objective fine-tuning

Uses weight conditioning for efficient one-shot Pareto front profiling

Augments temperature parameter for flexible post-training trade-offs

🔎 Similar Papers

SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning