REFA: Reference Free Alignment for multi-preference optimization

📅 2024-12-20

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the challenge of jointly optimizing response length and quality under multi-user preferences, this paper proposes REFA—a reference-free alignment method. Methodologically, REFA introduces three key innovations: (1) a bias-weighting mechanism to mitigate incentive bias in preference annotations; (2) EOS-probability regularization coupled with theory-driven length normalization to eliminate spurious model reliance on output length as a shortcut; and (3) the Unified Response Quality–Length Analysis framework (URSLA), providing theoretical foundations for length-controllable alignment. Evaluated on AlpacaEval v2, REFA achieves a Length-Controlled Win Rate (LC-WR) of 21.62% and an overall Win Rate (WR) of 19.87%, significantly outperforming InfoNCA and SimPO. Notably, REFA establishes a new state-of-the-art for reference-free, length-controllable alignment—marking the first such result in the absence of reference responses.

Technology Category

Application Category

📝 Abstract

We introduce REFA, a family of reference-free alignment methods that optimize over multiple user preferences while enforcing fine-grained length control. Our approach integrates deviation-based weighting to emphasize high-quality responses more strongly, length normalization to prevent trivial short-response solutions, and an EOS-probability regularizer to mitigate dataset-induced brevity biases. Theoretically, we show that under the Uncertainty Reduction with Sequence Length Assertion (URSLA), naive length normalization can still incentivize length-based shortcuts. By contrast, REFA corrects these subtle incentives, guiding models toward genuinely more informative and higher-quality outputs. Empirically, REFA sets a new state-of-the-art among reference-free alignment methods, producing richer responses aligned more closely with human preferences. Compared to a base supervised fine-tuned (SFT) mistral-7b model that achieves 8.4% length-controlled win rate (LC-WR) and 6.2% win rate (WR), our best REFA configuration attains 21.62% LC-WR and 19.87% WR on the AlpacaEval v2 benchmark. This represents a substantial improvement over both the strongest multi-preference baseline, InfoNCA (16.82% LC-WR, 10.44% WR), and the strongest reference-free baseline, SimPO (20.01% LC-WR, 17.65% WR)

Problem

Research questions and friction points this paper is trying to address.

Alignment Method

User Preferences

Output Length Adjustment

Innovation

Methods, ideas, or system contributions that make the work stand out.

REFA

Dynamic Length Adjustment

Preference-based Response Optimization

🔎 Similar Papers

Is Free Self-Alignment Possible?