Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Although large language models undergo alignment, unsafe behaviors persist from pretraining, and existing methods fail to explicitly remove the associated harmful subnetworks. This work proposes an efficient pruning framework that leverages a gradient-free parameter attribution mechanism to identify and eliminate “unsafe lottery tickets” linked to undesirable behaviors, thereby revealing latent “safe lottery tickets.” The approach requires no gradient computation, is compatible with diverse architectures and quantized models, and enables post-hoc alignment with minimal GPU resources. Experiments demonstrate that this strategy significantly reduces harmful output rates across multiple mainstream large language models, enhances robustness against jailbreaking attacks, and preserves general model performance with negligible degradation.

Technology Category

Application Category

📝 Abstract

Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain "unsafe tickets" responsible for harmful behaviors, and pruning reveals "safety tickets" that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.

Problem

Research questions and friction points this paper is trying to address.

unsafe behaviors

large language models

alignment

harmful outputs

subnetworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

pruning

unsafe tickets

resource-efficient