🤖 AI Summary
This work systematically identifies, for the first time, supply-chain-level security risks in RLHF platforms during reward modeling and alignment fine-tuning, introducing the novel threat model of *platform-induced model misalignment*. Targeting the vulnerability of preference datasets to tampering, we propose a goal-directed preference data poisoning attack: by selectively corrupting preference samples, we adversarially manipulate reward model training to induce latent, persistent alignment failures—specifically in safety- and value-sensitive tasks. We validate the attack across mainstream open-source RLHF frameworks (TRL, Axolotl), demonstrating >62% degradation in alignment fidelity for Llama-3 and Qwen models, while evading existing detection mechanisms. Our study establishes a new paradigm for RLHF platform security evaluation, providing critical insights and foundational defensive baselines for trustworthy alignment.
📝 Abstract
Reinforcement learning has shown remarkable performance in aligning language models with human preferences, leading to the rise of attention towards developing RLHF platforms. These platforms enable users to fine-tune models without requiring any expertise in developing complex machine learning algorithms. While these platforms offer useful features such as reward modeling and RLHF fine-tuning, their security and reliability remain largely unexplored. Given the growing adoption of RLHF and open-source RLHF frameworks, we investigate the trustworthiness of these systems and their potential impact on behavior of LLMs. In this paper, we present an attack targeting publicly available RLHF tools. In our proposed attack, an adversarial RLHF platform corrupts the LLM alignment process by selectively manipulating data samples in the preference dataset. In this scenario, when a user's task aligns with the attacker's objective, the platform manipulates a subset of the preference dataset that contains samples related to the attacker's target. This manipulation results in a corrupted reward model, which ultimately leads to the misalignment of the language model. Our results demonstrate that such an attack can effectively steer LLMs toward undesirable behaviors within the targeted domains. Our work highlights the critical need to explore the vulnerabilities of RLHF platforms and their potential to cause misalignment in LLMs during the RLHF fine-tuning process.