Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Recent attacks exploit reinforcement learning (RL) to fine-tune large language models (LLMs), undermining safety alignment and eliciting highly harmful outputs. Method: We propose TokenBuncher, the first defense mechanism to systematically identify and mitigate RL-driven adversarial fine-tuning risks. TokenBuncher innovatively employs response entropy as a core metric and jointly applies Token Noiser—a stochastic token-level perturbation—and uncertainty constraints to suppress output uncertainty, thereby disrupting RL’s optimization pathway toward malicious behaviors. Contribution/Results: Evaluated across diverse LLMs and RL algorithms (e.g., PPO, DPO), TokenBuncher significantly reduces harmful output generation while preserving standard task performance, fine-tunability, and cross-model generalizability. It demonstrates strong robustness against adaptive attacks and practical deployability without architectural modifications or retraining.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response uncertainty. By constraining uncertainty, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of expert-domain harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task utility and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.

Problem

Research questions and friction points this paper is trying to address.

Defending LLMs from harmful reinforcement learning fine-tuning attacks

Suppressing model response uncertainty to prevent safety alignment breaks

Mitigating advanced harmful task assistance while preserving benign utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

TokenBuncher defense against RL fine-tuning

Suppresses model response uncertainty mechanism

Uses entropy-as-reward and Token Noiser

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation