BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work identifies a novel clean-label poisoning attack targeting reward models in text-to-image (T2I) reinforcement learning: adversaries construct semantically inconsistent yet perceptually plausible image-text preference pairs—using only naturally occurring examples, without tampering with annotation procedures—to induce feature-space collisions and covertly distort reward signals. Methodologically, it introduces the first cross-modal (vision-language) clean-label reward model poisoning framework, integrating feature-space adversarial perturbations, multimodal embedding alignment modeling, and gradient sensitivity analysis. Evaluated on mainstream T2I systems including Stable Diffusion, the attack achieves over 82% success rate in generating biased or violent images, while poisoned samples remain imperceptible to human annotators. This work is the first to systematically expose and realize a highly stealthy, practically effective, annotation-process-free cross-modal reward model poisoning threat, establishing critical security insights and a technical benchmark for RLHF safety research.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning text-to-image (T2I) models with human preferences. However, RLHF's feedback mechanism also opens new pathways for adversaries. This paper demonstrates the feasibility of hijacking T2I models by poisoning a small fraction of preference data with natural-appearing examples. Specifically, we propose BadReward, a stealthy clean-label poisoning attack targeting the reward model in multi-modal RLHF. BadReward operates by inducing feature collisions between visually contradicted preference data instances, thereby corrupting the reward model and indirectly compromising the T2I model's integrity. Unlike existing alignment poisoning techniques focused on single (text) modality, BadReward is independent of the preference annotation process, enhancing its stealth and practical threat. Extensive experiments on popular T2I models show that BadReward can consistently guide the generation towards improper outputs, such as biased or violent imagery, for targeted concepts. Our findings underscore the amplified threat landscape for RLHF in multi-modal systems, highlighting the urgent need for robust defenses. Disclaimer. This paper contains uncensored toxic content that might be offensive or disturbing to the readers.

Problem

Research questions and friction points this paper is trying to address.

Hijacking T2I models via clean-label poisoning

Corrupting reward models with visually contradicted data

Generating improper outputs like biased or violent imagery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clean-label poisoning attack on reward models

Induces feature collisions in multi-modal data

Bypasses preference annotation for stealthy attacks

🔎 Similar Papers

TLDR: Token-Level Detective Reward Model for Large Vision Language Models