Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research lacks a systematic evaluation of trade-offs among five core alignment objectives—factual consistency, safety, conciseness, proactiveness, and diversity—across large language model (LLM) alignment methods. This paper introduces the first unified, multidimensional evaluation framework, leveraging human-validated LLM-as-Judge prompts to benchmark prominent alignment algorithms—including PPO, DPO, ORPO, and KTO—on both in-distribution and out-of-distribution data. Key findings reveal that DPO and KTO achieve superior factual consistency; PPO and DPO excel in safety; and PPO strikes the best balance between conciseness and proactiveness. Furthermore, we identify an intrinsic tension between generalization and diversity. These empirically grounded insights provide actionable guidance for algorithm selection and the development of reliable, aligned LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) require careful alignment to balance competing objectives - factuality, safety, conciseness, proactivity, and diversity. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of the inherent trade-offs. We propose a unified evaluation framework that compares LLM alignment methods (PPO, DPO, ORPO, KTO) across these five axes, using both in-distribution and out-of-distribution datasets. Leveraging a specialized LLM-as-Judge prompt, validated through human studies, we reveal that DPO and KTO excel in factual accuracy, PPO and DPO lead in safety, and PPO best balances conciseness with proactivity. Our findings provide insights into trade-offs of common alignment methods, guiding the development of more balanced and reliable LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating trade-offs in LLM alignment methods
Assessing alignment across diversity, generalization, safety
Comparing PPO, DPO, ORPO, KTO performance holistically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified evaluation framework for alignment methods
LLM-as-Judge prompt validated through human studies
Compares PPO DPO ORPO KTO across five axes
🔎 Similar Papers
No similar papers found.
Denis Janiak
Denis Janiak
PhD student, Wrocław University of Science and Technology
bayesian deep learningprobabilistic machine learningrepresentation learning
J
Julia Moska
Wroclaw University of Science and Technology (WUST)
D
Dawid Motyka
Wroclaw University of Science and Technology (WUST)
Karolina Seweryn
Karolina Seweryn
NASK - National Research Institute, Warsaw University of Technology
P
Paweł Walkowiak
Wroclaw University of Science and Technology (WUST)
B
Bartosz Żuk
Institute of Computer Science, Polish Academy of Sciences (IPI PAN)
Arkadiusz Janz
Arkadiusz Janz
Wrocław University of Science and Technology
machine learningnatural language processingcomputational linguistics