π€ AI Summary
Although large language models undergo alignment, unsafe behaviors persist from pretraining, and existing methods fail to explicitly remove the associated harmful subnetworks. This work proposes an efficient pruning framework that leverages a gradient-free parameter attribution mechanism to identify and eliminate βunsafe lottery ticketsβ linked to undesirable behaviors, thereby revealing latent βsafe lottery tickets.β The approach requires no gradient computation, is compatible with diverse architectures and quantized models, and enables post-hoc alignment with minimal GPU resources. Experiments demonstrate that this strategy significantly reduces harmful output rates across multiple mainstream large language models, enhances robustness against jailbreaking attacks, and preserves general model performance with negligible degradation.
π Abstract
Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain "unsafe tickets" responsible for harmful behaviors, and pruning reveals "safety tickets" that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.