Learning to Align Human Code Preferences

๐Ÿ“… 2025-07-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses the challenge of aligning large language models (LLMs) with diverse human code preferences, where conventional methods struggle to generalize across heterogeneous preference structures. We propose Adaptive Preference Optimization (APO), a novel framework that dynamically balances supervision from supervised fine-tuning (SFT) and preference signals from direct preference optimization (DPO), conditioned on task-specific characteristics. APO jointly models three complementary objectives: enhancing preferred responses, suppressing dispreferred ones, and encouraging exploratory solution generation. Theoretically, we characterize the complementary regimes of SFT and DPOโ€”identifying their respective optimality conditions under varying preference structures, including locally dense optima and long-tailed distributions. Empirically, we evaluate APO across six representative code preference tasks; it consistently matches or outperforms SFT, DPO, and the state-of-the-art S&D baseline, demonstrating significantly improved generalization and alignment fidelity for complex, heterogeneous code preferences.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human preferences, the optimal training strategy remains unclear across diverse code preference scenarios. This paper systematically investigates the roles of SFT and DPO in aligning LLMs with different code preferences. Through both theoretical analysis and empirical observation, we hypothesize that SFT excels in scenarios with objectively verifiable optimal solutions, while applying SFT followed by DPO (S&D) enables models to explore superior solutions in scenarios without objectively verifiable optimal solutions. Based on the analysis and experimental evidence, we propose Adaptive Preference Optimization (APO), a dynamic integration approach that adaptively amplifies preferred responses, suppresses dispreferred ones, and encourages exploration of potentially superior solutions during training. Extensive experiments across six representative code preference tasks validate our theoretical hypotheses and demonstrate that APO consistently matches or surpasses the performance of existing SFT and S&D strategies. Our work provides both theoretical foundations and practical guidance for selecting appropriate training strategies in different code preference alignment scenarios.
Problem

Research questions and friction points this paper is trying to address.

Optimizing training strategies for aligning LLMs with human code preferences
Comparing SFT and DPO effectiveness in diverse code preference scenarios
Proposing Adaptive Preference Optimization for dynamic response adjustment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines SFT and DPO for code alignment
Proposes Adaptive Preference Optimization (APO)
Dynamically adjusts training for superior solutions
๐Ÿ”Ž Similar Papers
No similar papers found.