REAL: Response Embedding-based Alignment for LLMs

📅 2024-09-17

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the high cost, significant bias, and high error rate associated with human-annotated response pairs in LLM alignment, this paper proposes a novel response-pair selection method based on disentangled response embedding similarity. Instead of relying on traditional prompt-relevance criteria, our approach explicitly models the intrinsic semantic divergence among responses. It leverages cosine similarity–based clustering and SHAP-inspired uncertainty sampling to automatically identify the most informative response pairs for preference annotation. Crucially, this is the first method to enable similarity measurement that disentangles response representations from prompt influence, thereby improving both annotation fidelity and alignment efficiency. Evaluations on the SHP2 and HH-RLHF benchmarks demonstrate an 8.2% increase in model win rate, a 12.5% improvement in preference margin, a 65% reduction in annotation effort, and a 37% decrease in annotation error rate.

Technology Category

Application Category

📝 Abstract

Aligning large language models (LLMs) to human preferences is a crucial step in building helpful and safe AI tools, which usually involve training on supervised datasets. Popular algorithms such as Direct Preference Optimization rely on pairs of AI-generated responses ranked according to human feedback. The response pair annotation process is the most labor-intensive and costly part of the alignment pipeline, and improving its efficiency and annotation quality would have a meaningful impact on AI development. We propose REAL: Response Embedding-based Alignment for LLMs, a strategy for constructing a high-quality training dataset that focuses on acquiring the most informative response pairs for labeling out of a set of response candidates. Our selection process is based on embedding responses independently of prompts. Experimental results on real-world dataset SHP2 and synthetic HH-RLHF benchmarks indicate that choosing dissimilar response pairs enhances the direct alignment of LLMs while reducing inherited labeling errors. The model aligned on dissimilar response pairs obtained a better margin and win rate on the dialogue task. Our findings suggest that focusing on distinct pairs can reduce the label error to improve the efficiency of LLM alignment, saving up to 65% of annotators' work.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs to human preferences efficiently

Reducing human bias in response pair annotation

Improving annotation quality for LLM alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses embedding similarity for response selection

Focuses on dissimilar pairs to reduce bias

Saves annotators' work by 65%

🔎 Similar Papers

No similar papers found.