Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

πŸ“… 2026-04-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

233K/year
πŸ€– AI Summary
Although large language models undergo alignment, unsafe behaviors persist from pretraining, and existing methods fail to explicitly remove the associated harmful subnetworks. This work proposes an efficient pruning framework that leverages a gradient-free parameter attribution mechanism to identify and eliminate β€œunsafe lottery tickets” linked to undesirable behaviors, thereby revealing latent β€œsafe lottery tickets.” The approach requires no gradient computation, is compatible with diverse architectures and quantized models, and enables post-hoc alignment with minimal GPU resources. Experiments demonstrate that this strategy significantly reduces harmful output rates across multiple mainstream large language models, enhances robustness against jailbreaking attacks, and preserves general model performance with negligible degradation.

Technology Category

Application Category

πŸ“ Abstract
Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain "unsafe tickets" responsible for harmful behaviors, and pruning reveals "safety tickets" that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.
Problem

Research questions and friction points this paper is trying to address.

unsafe behaviors
large language models
alignment
harmful outputs
subnetworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

pruning
unsafe tickets
resource-efficient
LLM alignment
Lottery Ticket Hypothesis
πŸ”Ž Similar Papers
No similar papers found.