Learning to Align Human Code Preferences

📅 2025-07-26

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This paper addresses the challenge of aligning large language models (LLMs) with diverse human code preferences, where conventional methods struggle to generalize across heterogeneous preference structures. We propose Adaptive Preference Optimization (APO), a novel framework that dynamically balances supervision from supervised fine-tuning (SFT) and preference signals from direct preference optimization (DPO), conditioned on task-specific characteristics. APO jointly models three complementary objectives: enhancing preferred responses, suppressing dispreferred ones, and encouraging exploratory solution generation. Theoretically, we characterize the complementary regimes of SFT and DPO—identifying their respective optimality conditions under varying preference structures, including locally dense optima and long-tailed distributions. Empirically, we evaluate APO across six representative code preference tasks; it consistently matches or outperforms SFT, DPO, and the state-of-the-art S&D baseline, demonstrating significantly improved generalization and alignment fidelity for complex, heterogeneous code preferences.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human preferences, the optimal training strategy remains unclear across diverse code preference scenarios. This paper systematically investigates the roles of SFT and DPO in aligning LLMs with different code preferences. Through both theoretical analysis and empirical observation, we hypothesize that SFT excels in scenarios with objectively verifiable optimal solutions, while applying SFT followed by DPO (S&D) enables models to explore superior solutions in scenarios without objectively verifiable optimal solutions. Based on the analysis and experimental evidence, we propose Adaptive Preference Optimization (APO), a dynamic integration approach that adaptively amplifies preferred responses, suppresses dispreferred ones, and encourages exploration of potentially superior solutions during training. Extensive experiments across six representative code preference tasks validate our theoretical hypotheses and demonstrate that APO consistently matches or surpasses the performance of existing SFT and S&D strategies. Our work provides both theoretical foundations and practical guidance for selecting appropriate training strategies in different code preference alignment scenarios.

Problem

Research questions and friction points this paper is trying to address.

Optimizing training strategies for aligning LLMs with human code preferences

Comparing SFT and DPO effectiveness in diverse code preference scenarios

Proposing Adaptive Preference Optimization for dynamic response adjustment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines SFT and DPO for code alignment

Proposes Adaptive Preference Optimization (APO)

Dynamically adjusts training for superior solutions

🔎 Similar Papers

CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences