SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Current LLM safety alignment research lacks standardized evaluation protocols during parameter-efficient fine-tuning (PEFT), leading to unfair comparisons of defense methods across safety, practicality, and robustness. To address this, we propose the first standardized evaluation toolkit for safety alignment at the fine-tuning stage, establishing a unified “data–defense–evaluation” benchmarking framework. It supports dynamic generation of multi-task harmful variants, integrates mainstream safety mechanisms—including alignment-time immunity, in-training protection, and post-fine-tuning repair—and enables end-to-end reproducible evaluation. Implemented in Python, the toolkit adopts a dataclass-based configuration system and a modular, plugin-driven architecture. We conduct systematic evaluations across diverse poisoning scenarios and tasks, benchmarking representative PEFT-based safety methods. Our toolkit significantly enhances rigor, comparability, and reproducibility in safety alignment research.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) become ubiquitous, parameter-efficient fine-tuning methods and safety-first defenses have proliferated rapidly. However, the number of approaches and their recent increase have resulted in diverse evaluations-varied datasets, metrics, and inconsistent threat settings-making it difficult to fairly compare safety, utility, and robustness across methods. To address this, we introduce SafeTuneBed, a benchmark and toolkit unifying fine-tuning and defense evaluation. SafeTuneBed (i) curates a diverse repository of multiple fine-tuning datasets spanning sentiment analysis, question-answering, multi-step reasoning, and open-ended instruction tasks, and allows for the generation of harmful-variant splits; (ii) enables integration of state-of-the-art defenses, including alignment-stage immunization, in-training safeguards, and post-tuning repair; and (iii) provides evaluators for safety (attack success rate, refusal consistency) and utility. Built on Python-first, dataclass-driven configs and plugins, SafeTuneBed requires minimal additional code to specify any fine-tuning regime, defense method, and metric suite, while ensuring end-to-end reproducibility. We showcase its value by benchmarking representative defenses across varied poisoning scenarios and tasks. By standardizing data, code, and metrics, SafeTuneBed is the first focused toolkit of its kind to accelerate rigorous and comparable research in safe LLM fine-tuning. Code is available at: https://github.com/criticalml-uw/SafeTuneBed

Problem

Research questions and friction points this paper is trying to address.

Unify diverse safety alignment evaluations for LLM fine-tuning

Standardize datasets, defenses, and metrics for fair comparisons

Accelerate reproducible research in safe LLM fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies fine-tuning and defense evaluation toolkit

Integrates state-of-the-art safety defenses

Ensures end-to-end reproducibility with minimal code

🔎 Similar Papers

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

2024-02-28arXiv.orgCitations: 28