Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks

๐Ÿ“… 2025-05-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing RLHF methods exhibit insufficient preference alignment in multi-instruction tasks: they over-rely on human annotations or powerful LLMs and model only intra-sample response comparisons, neglecting implicit prompt semantics and cross-sample preference correlations among instruction combinations. Method: We propose the Multi-level Aware Preference Learning (MAPL) frameworkโ€”the first to jointly model implicit prompt semantics and cross-sample multi-instruction preference dependencies, breaking the conventional single-layer response-comparison paradigm. MAPL constructs intra-sample datasets via synthetically generated multi-condition prompts and inter-sample datasets via instruction-combination sampling, both tailored for reward modeling and DPO training. Contribution/Results: MAPL significantly improves instruction-following accuracy and preference consistency across multiple multi-instruction benchmarks, without requiring additional human annotations or stronger LLMs. It achieves a superior balance among performance, training efficiency, and fairness.

Technology Category

Application Category

๐Ÿ“ Abstract
RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences, demonstrating exceptional and measurable efficacy in instruction following tasks; however, it exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks. Conventional approaches rely heavily on human annotation or more sophisticated large language models, thereby introducing substantial resource expenditure or potential bias concerns. Meanwhile, alternative synthetic methods that augment standard preference datasets often compromise the model's semantic quality. Our research identifies a critical oversight in existing techniques, which predominantly focus on comparing responses while neglecting valuable latent signals embedded within prompt inputs, and which only focus on preference disparities at the intra-sample level, while neglecting to account for the inter-sample level preference differentials that exist among preference data. To leverage these previously neglected indicators, we propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities. Specifically, for any given response in original preference data pairs, we construct varied prompts with a preference relation under different conditions, in order to learn intra-sample level preference disparities. Furthermore, for any given original preference pair, we synthesize multi-instruction preference pairs to capture preference discrepancies at the inter-sample level. Building on the two datasets constructed above, we consequently devise two sophisticated training objective functions. Subsequently, our framework integrates seamlessly into both Reward Modeling and Direct Preference Optimization paradigms. Through rigorous evaluation across multiple benchmarks, we empirically validate the efficacy of our framework.
Problem

Research questions and friction points this paper is trying to address.

RLHF lacks compliance in complex multi-instruction tasks
Existing methods neglect latent signals in prompt inputs
Current approaches ignore inter-sample level preference differentials
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level Aware Preference Learning framework
Constructs intra-sample level preference disparities
Synthesizes inter-sample level preference discrepancies
๐Ÿ”Ž Similar Papers
No similar papers found.
R
Ruopei Sun
University of Science and Technology of China
J
Jianfeng Cai
University of Science and Technology of China
Jinhua Zhu
Jinhua Zhu
University of Science and Technology of China
Machine Learning
K
Kangwen Zhao
University of Science and Technology of China
D
Dongyun Xue
University of Science and Technology of China
Wengang Zhou
Wengang Zhou
Professor, EEIS Department, University of Science and Technology of China
Multimedia RetrievalComputer VisionComputer Game
L
Li Li
University of Science and Technology of China
Houqiang Li
Houqiang Li
Professor, Department of Electric Engineering and Information Science, University of Science and
Multimedia SearchImage/Video AnalysisImage/Video Coding