DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

163K/year
🤖 AI Summary
Existing preference optimization methods for large language models struggle to simultaneously preserve reasoning diversity and ensure directional consistency in alignment. To address this challenge, this work proposes Directional Group Preference Optimization (DGPO), a novel framework that constructs paired positive–negative question-answering instances to introduce group-level supervision signals. DGPO explicitly models direction-aware alignment through a multi-candidate comparison mechanism and employs a margin-based likelihood objective to effectively discriminate between consistent and inconsistent reasoning paths. By moving beyond conventional pairwise preference optimization, the method achieves an average improvement of 3.2% across five benchmarks, with a peak gain of 3.6%, significantly enhancing the model’s ability to maintain diverse reasoning while adhering to logical consistency.
📝 Abstract
Although Large Language Models (LLMs) have made remarkable progress, current preference optimization methods still struggle to align directional consistency while preserving reasoning diversity. To address this limitation, we propose Directional-Groupwise Preference Optimization (DGPO), a lightweight framework that aggregates supervision signals at the group level and explicitly models direction-aware alignment through multi-candidate comparisons. DGPO organizes forward and reverse question-answer instances into structured sets and optimizes a margin-based likelihood objective that separates coherent reasoning paths from inconsistent alternatives. This group-wise formulation captures richer relative information than pairwise objectives and reinforces consistency across diverse reasoning pathways. Empirical results show that our constructed reverse data yields a 3.2% average improvement across five benchmarks, while DGPO further delivers consistent gains across multiple datasets and model families, achieving average accuracy improvements of up to 3.6%.
Problem

Research questions and friction points this paper is trying to address.

directional consistency
preference optimization
reasoning diversity
Large Language Models
alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Directional Consistency
Groupwise Optimization
Preference Alignment
Multi-candidate Comparison
Margin-based Likelihood
M
Mengyi Deng
Information Hub, The Hong Kong University of Science and Technology (Guangzhou), China
Zhiwei Li
Zhiwei Li
Hong Kong University of Science and Technology (GuangZhou)
Large Language Models
X
Xin Li
Information Hub, The Hong Kong University of Science and Technology (Guangzhou), China
T
Tingyu Zhu
Information Hub, The Hong Kong University of Science and Technology (Guangzhou), China
Y
Yulan Yuan
Information Hub, The Hong Kong University of Science and Technology (Guangzhou), China
Zhijiang Guo
Zhijiang Guo
HKUST (GZ) | HKUST
Natural Language ProcessingMachine LearningLarge Language Models
Wei Wang
Wei Wang
The Hong Kong University of Science and Technology
Cloud ComputingMachine Learning SystemsBig Data SystemsComputer Networking