🤖 AI Summary
To address the challenge of jointly optimizing response length and quality under multi-user preferences, this paper proposes REFA—a reference-free alignment method. Methodologically, REFA introduces three key innovations: (1) a bias-weighting mechanism to mitigate incentive bias in preference annotations; (2) EOS-probability regularization coupled with theory-driven length normalization to eliminate spurious model reliance on output length as a shortcut; and (3) the Unified Response Quality–Length Analysis framework (URSLA), providing theoretical foundations for length-controllable alignment. Evaluated on AlpacaEval v2, REFA achieves a Length-Controlled Win Rate (LC-WR) of 21.62% and an overall Win Rate (WR) of 19.87%, significantly outperforming InfoNCA and SimPO. Notably, REFA establishes a new state-of-the-art for reference-free, length-controllable alignment—marking the first such result in the absence of reference responses.
📝 Abstract
We introduce REFA, a family of reference-free alignment methods that optimize over multiple user preferences while enforcing fine-grained length control. Our approach integrates deviation-based weighting to emphasize high-quality responses more strongly, length normalization to prevent trivial short-response solutions, and an EOS-probability regularizer to mitigate dataset-induced brevity biases. Theoretically, we show that under the Uncertainty Reduction with Sequence Length Assertion (URSLA), naive length normalization can still incentivize length-based shortcuts. By contrast, REFA corrects these subtle incentives, guiding models toward genuinely more informative and higher-quality outputs. Empirically, REFA sets a new state-of-the-art among reference-free alignment methods, producing richer responses aligned more closely with human preferences. Compared to a base supervised fine-tuned (SFT) mistral-7b model that achieves 8.4% length-controlled win rate (LC-WR) and 6.2% win rate (WR), our best REFA configuration attains 21.62% LC-WR and 19.87% WR on the AlpacaEval v2 benchmark. This represents a substantial improvement over both the strongest multi-preference baseline, InfoNCA (16.82% LC-WR, 10.44% WR), and the strongest reference-free baseline, SimPO (20.01% LC-WR, 17.65% WR)