🤖 AI Summary
Large language models (LLMs) inherently pose safety risks, notably the generation of harmful content; existing defenses—such as RLHF and adversarial training—suffer from poor generalizability, heavy reliance on hand-crafted rules, and vulnerability to novel jailbreak attacks. This paper introduces RepBend, a loss-driven representation-level intervention method that proactively distorts internal representations responsible for harmful behavior during fine-tuning, enabling fundamental safety enhancement. Its core innovation lies in the first end-to-end differentiable integration of activation steering: it employs gradient-guided representation bending, vector arithmetic modeling in activation space, and an adversarial decoupling loss to achieve lightweight full-parameter fine-tuning. Evaluated across diverse jailbreak benchmarks, RepBend reduces attack success rates by up to 95%, substantially outperforming Circuit Breaker, RMU, and NPO, while preserving near-original general capabilities and usability.
📝 Abstract
Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering - simple vector arithmetic for steering model's behavior during inference - to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.