GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the vulnerability of existing safety alignment methods, which are often circumvented post-deployment, and the limitations of prevailing de-alignment techniques that rely heavily on labeled data and degrade model utility. The authors propose GRP-Obliteration, a novel approach that, for the first time, effectively removes safety constraints from both large language models and diffusion-based image generators using only a single unlabeled prompt. Leveraging Group Relative Policy Optimization (GRPO), this method operates without any supervisory signals, thereby overcoming dependencies on annotated data and architectural constraints. Evaluated across fifteen models ranging from 7B to 20B parameters and spanning diverse mainstream architectures, GRP-Obliteration achieves superior de-alignment performance on five established safety benchmarks while preserving original capabilities on six general-purpose evaluation suites.

Technology Category

Application Category

📝 Abstract

Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

unalignment

large language models

prompt-based attack

model robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

GRP-Obliteration

unalignment

Group Relative Policy Optimization