Adversarial Preference Learning for Robust LLM Alignment

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of Reinforcement Learning from Human Feedback (RLHF) against adversarial attacks—stemming from inefficient human annotation, high attack diversity, and risks of feedback bias and reward hacking—this paper proposes an iterative adversarial preference learning framework. Our key contributions are: (1) the first direct harmfulness metric grounded in model-intrinsic preference probabilities; (2) an input-aware conditional generative attacker; and (3) a fully automated, human-free closed-loop alignment mechanism. Evaluated on Mistral-7B, our method reduces harmful output rate from 5.88% to 0.43%; achieves an 83.33% harmlessness win rate under GPT-4o evaluation; decreases HarmBench attack success rate by 65%; and maintains strong utility with an MT-Bench score of 6.59—demonstrating simultaneous improvements in safety and practical performance.

Technology Category

Application Category

📝 Abstract
Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we introduce Adversarial Preference Learning (APL), an iterative adversarial training method incorporating three key innovations. First, a direct harmfulness metric based on the model's intrinsic preference probabilities, eliminating reliance on external assessment. Second, a conditional generative attacker that synthesizes input-specific adversarial variations. Third, an iterative framework with automated closed-loop feedback, enabling continuous adaptation through vulnerability discovery and mitigation. Experiments on Mistral-7B-Instruct-v0.3 demonstrate that APL significantly enhances robustness, achieving 83.33% harmlessness win rate over the base model (evaluated by GPT-4o), reducing harmful outputs from 5.88% to 0.43% (measured by LLaMA-Guard), and lowering attack success rate by up to 65% according to HarmBench. Notably, APL maintains competitive utility, with an MT-Bench score of 6.59 (comparable to the baseline 6.78) and an LC-WinRate of 46.52% against the base model.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiency and high cost of human annotation in RLHF
Mitigates diverse adversarial attacks on language models
Reduces feedback bias and reward hacking risks in alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct harmfulness metric using intrinsic preference probabilities
Conditional generative attacker for input-specific variations
Iterative framework with automated closed-loop feedback
Y
Yuanfu Wang
Shanghai Artificial Intelligence Laboratory, China
P
Pengyu Wang
Software College, Northeastern University, China
Chenyang Xi
Chenyang Xi
Beijing Institute of Technology
Reinforcement Learning
B
Bo Tang
University of Science and Technology of China, Suzhou Institute for Advanced Research, Suzhou, China
J
Junyi Zhu
MemTensor (Shanghai) Technology Co., Ltd, China
W
Wenqiang Wei
MemTensor (Shanghai) Technology Co., Ltd, China
C
Chen Chen
MemTensor (Shanghai) Technology Co., Ltd, China
C
Chao Yang
Shanghai Artificial Intelligence Laboratory, China
J
Jingfeng Zhang
University of Auckland, New Zealand
Chaochao Lu
Chaochao Lu
Shanghai AI Laboratory
Causal AI
Y
Yijun Niu
MemTensor (Shanghai) Technology Co., Ltd, China
K
Keming Mao
Software College, Northeastern University, China
Zhiyu Li
Zhiyu Li
Tianjin University
Robust controlattitude control
Feiyu Xiong
Feiyu Xiong
MemTensor (Shanghai) Technology Co., Ltd.
Machine LearningNLPLLM
J
Jie Hu
China Telecom Corporation Limited Beijing Research Institute, China
M
Mingchuan Yang
China Telecom Corporation Limited Beijing Research Institute, China