HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing preference optimization methods struggle to provide fine-grained feedback on multi-step solutions in complex reasoning tasks and face a trade-off between training stability and structured reasoning. To address this, this work proposes Hierarchical Preference Optimization (HiPO), a novel framework that introduces hierarchical structure into Direct Preference Optimization (DPO) for the first time. HiPO decomposes model responses into segments—such as query clarification, intermediate reasoning steps, and final answers—and computes DPO losses for each segment separately before fusing them with learned weights. This approach enables fine-grained alignment with human preferences while maintaining training efficiency. Experiments demonstrate that HiPO significantly outperforms DPO and other baselines across multiple 7B-scale models, achieving state-of-the-art performance on mathematical reasoning benchmarks and receiving higher evaluations from GPT-4.1 in terms of logical coherence, fluency, and consistency.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., DPO variants like KTO and RSO) or structured reasoning (e.g., ReMA's multi-agent RL framework, Tree of Thoughts), but fail to merge these complementary strengths. We propose HiPO (Hierarchical Preference Optimization), an extension of DPO that separates responses into reasoning segments (query clarification and context, reasoning steps, and answer) and computes loss as a weighted sum of the DPO loss for each segment. Our approach enables segment-specific training while maintaining DPO's computational efficiency and training stability. We demonstrate that for multiple 7B LLMs fine-tuned using HiPO and DPO on the Math Stack Exchange preference dataset, the models trained with HiPO outperform the others on a variety of common math benchmarks and achieve greater organization, logical flow, and consistency as measured by GPT-4.1.

Problem

Research questions and friction points this paper is trying to address.

Direct Preference Optimization

complex reasoning

preference alignment

fine-grained feedback

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Preference Optimization

Direct Preference Optimization

reasoning segmentation