MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization

📅 2024-10-10
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the “weak-to-strong alignment” challenge—where weak supervisors struggle to align stronger LLM students—the paper proposes a multi-agent collaborative contrastive preference optimization framework. Methodologically, it introduces (1) a novel bidirectional co-evolution mechanism between weak teachers and strong students, enabling mutual behavioral reinforcement and adaptive hard-negative sample construction; and (2) iterative behavioral distillation, departing from conventional strong-to-weak or self-alignment paradigms. Evaluated on HH-RLHF and PKU-SafeRLHF benchmarks, the approach significantly improves alignment performance for both teachers and students, with consistent gains confirmed by automated metrics and human evaluation. Notably, teacher scalability is demonstrated: alignment quality improves steadily as teacher capacity increases, indicating robust extensibility.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) are rapidly advancing and achieving near-human capabilities on specific tasks, aligning them with human values is becoming more urgent. In scenarios where LLMs outperform humans, we face a weak-to-strong alignment problem where we need to effectively align strong student LLMs through weak supervision generated by weak teachers. Existing alignment methods mainly focus on strong-to-weak alignment and self-alignment settings, and it is impractical to adapt them to the much harder weak-to-strong alignment setting. To fill this gap, we propose a multi-agent contrastive preference optimization (MACPO) framework. MACPO facilitates weak teachers and strong students to learn from each other by iteratively reinforcing unfamiliar positive behaviors while penalizing familiar negative ones. To get this, we devise a mutual positive behavior augmentation strategy to encourage weak teachers and strong students to learn from each other's positive behavior and further provide higher quality positive behavior for the next iteration. Additionally, we propose a hard negative behavior construction strategy to induce weak teachers and strong students to generate familiar negative behavior by fine-tuning on negative behavioral data. Experimental results on the HH-RLHF and PKU-SafeRLHF datasets, evaluated using both automatic metrics and human judgments, demonstrate that MACPO simultaneously improves the alignment performance of strong students and weak teachers. Moreover, as the number of weak teachers increases, MACPO achieves better weak-to-strong alignment performance through more iteration optimization rounds.
Problem

Research questions and friction points this paper is trying to address.

Align strong LLMs with weak supervision from weak teachers.
Improve alignment performance using multi-agent contrastive preference optimization.
Enhance weak-to-strong alignment through iterative behavior reinforcement.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent contrastive preference optimization framework
Mutual positive behavior augmentation strategy
Hard negative behavior construction strategy
🔎 Similar Papers
No similar papers found.