Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment

📅 2026-01-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the vulnerability of large language models to safety alignment degradation during fine-tuning, even when exposed to seemingly benign data. To mitigate this issue, the authors propose the Safety Optimal Transport (SOT) framework, which introduces optimal transport theory into safety-aware fine-tuning for the first time. SOT elevates the problem from instance-level filtering to distribution-level alignment by employing a dual-reference “push-pull” mechanism that dynamically reweights training samples. This approach explicitly repels harmful patterns and attracts safe anchors at the global data distribution level, thereby constructing a geometric safety boundary to purify the training data. Experiments demonstrate that SOT significantly enhances model safety across diverse architectures and domains while preserving downstream task performance, achieving a superior trade-off between safety and utility compared to existing methods.

Technology Category

Application Category

📝 Abstract

The inherent safety alignment of Large Language Models (LLMs) is prone to erosion during fine-tuning, even when using seemingly innocuous datasets. While existing defenses attempt to mitigate this via data selection, they typically rely on heuristic, instance-level assessments that neglect the global geometry of the data distribution and fail to explicitly repel harmful patterns. To address this, we introduce Safety Optimal Transport (SOT), a novel framework that reframes safe fine-tuning from an instance-level filtering challenge to a distribution-level alignment task grounded in Optimal Transport (OT). At its core is a dual-reference ``push-pull''weight-learning mechanism: SOT optimizes sample importance by actively pulling the downstream distribution towards a trusted safe anchor while simultaneously pushing it away from a general harmful reference. This establishes a robust geometric safety boundary that effectively purifies the training data. Extensive experiments across diverse model families and domains demonstrate that SOT significantly enhances model safety while maintaining competitive downstream performance, achieving a superior safety-utility trade-off compared to baselines.

Problem

Research questions and friction points this paper is trying to address.

LLM safety

fine-tuning

distributional alignment

safety erosion

harmful patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Safety Optimal Transport

Push-Pull Alignment

Distributional Alignment

Optimal Transport

Safe Fine-tuning

🔎 Similar Papers

No similar papers found.

Authors to Follow