Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the limited robustness of harmful fine-tuning defenses in “fine-tuning-as-a-service” for large language models—stemming from low-quality safety-aligned training data—this paper proposes Pharmacist. We first identify and empirically demonstrate that alignment data quality critically constrains defense robustness. Pharmacist introduces a learnable data selector that dynamically identifies high-quality, safety-critical samples while suppressing low-quality and non-safety-relevant ones. Compatible with state-of-the-art defenses such as RepNoise and T-Vaccine, Pharmacist achieves consistent improvements across multiple benchmarks: +2.60–3.30% defense accuracy, +1.10–3.50% inference speedup, and −56.8–57.6% reduction in training time—substantially outperforming existing data selection approaches. Our core contribution lies in integrating explicit data quality modeling into the alignment-based defense framework, enabling efficient, robust, and plug-and-play enhancement of secure fine-tuning.

Technology Category

Application Category

📝 Abstract

Harmful fine-tuning issues present significant safety challenges for fine-tuning-as-a-service in large language models. Existing alignment-stage defenses, e.g., Vaccine, Repnoise, Booster, and T-Vaccine, mitigate harmful fine-tuning issues by enhancing the model's robustness during the alignment phase. While these methods have been proposed to mitigate the issue, they often overlook a critical upstream factor: the role of the original safety-alignment data. We observe that their defense performance and computational efficiency remain constrained by the quality and composition of the alignment dataset. To address this limitation, we propose Pharmacist, a safety alignment data curation solution that enhances defense against harmful fine-tuning by selecting a high-quality and safety-critical core subset from the original alignment data. The core idea of Pharmacist is to train an alignment data selector to rank alignment data. Specifically, up-ranking high-quality and safety-critical alignment data, down-ranking low-quality and non-safety-critical data. Empirical results indicate that models trained on datasets selected by Pharmacist outperform those trained on datasets selected by existing selection methods in both defense and inference performance. In addition, Pharmacist can be effectively integrated with mainstream alignment-stage defense methods. For example, when applied to RepNoise and T-Vaccine, using the dataset selected by Pharmacist instead of the full dataset leads to improvements in defense performance by 2.60% and 3.30%, respectively, and enhances inference performance by 3.50% and 1.10%. Notably, it reduces training time by 56.83% and 57.63%, respectively. Our code is available at https://github.com/Lslland/Pharmacist.

Problem

Research questions and friction points this paper is trying to address.

Addresses harmful fine-tuning vulnerabilities in large language models

Improves safety alignment data quality for enhanced defense performance

Reduces computational costs while maintaining model inference capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curates safety alignment data by selecting critical core subsets

Ranks alignment data to prioritize safety-critical examples

Integrates with existing defenses to enhance performance efficiency

🔎 Similar Papers

No similar papers found.