References Improve LLM Alignment in Non-Verifiable Domains

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the challenge of aligning large language models (LLMs) in settings where ground-truth verifiers are unavailable, rendering conventional reinforcement learning approaches inapplicable. We propose leveraging high-quality reference outputs to guide an LLM acting as a soft verifier, thereby enhancing its judgment accuracy and enabling self-improvement through iterative training. To our knowledge, this is the first systematic demonstration that reference outputs substantially improve an LLM’s evaluative capability and effectively support unsupervised alignment fine-tuning. Evaluated on AlpacaEval and Arena-Hard, our method achieves scores of 73.1%/58.7% with Llama-3-8B-Instruct and 70.0%/74.1% with Qwen2.5-7B, respectively—improving by 20.2 and 17.1 points on average over supervised fine-tuning distillation and significantly outperforming reference-free self-improvement baselines.

Technology Category

Application Category

📝 Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.

Problem

Research questions and friction points this paper is trying to address.

LLM alignment

non-verifiable domains

reference-guided evaluation

reward modeling

self-improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-guided evaluation

LLM alignment

non-verifiable domains