Token-level Data Selection for Safe LLM Fine-tuning

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of large language models to safety degradation during fine-tuning on customized data, a challenge inadequately mitigated by existing sample-level defense methods that struggle to balance safety and task utility. To overcome this limitation, we propose TOSS, the first token-level data selection framework, along with its iterative refinement strategy, TOSS-Pro. Our approach enables fine-grained diagnosis by quantifying the safety risk of each token through the loss discrepancy between a safety-degraded model and a utility-oriented model. High-risk tokens are pruned while preserving informative content, thereby transcending the constraints of sample-level filtering. Extensive experiments demonstrate that TOSS and TOSS-Pro consistently outperform state-of-the-art defenses across multiple downstream tasks, effectively safeguarding model safety without compromising—and often even enhancing—task performance.

Technology Category

Application Category

📝 Abstract
Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at https://github.com/Polly-LYP/TOSS.
Problem

Research questions and friction points this paper is trying to address.

LLM fine-tuning
safety degradation
data selection
token-level analysis
model safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

token-level data selection
safe fine-tuning
safety degradation
TOSS
progressive refinement
🔎 Similar Papers
No similar papers found.