Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the absence of a unified, lifecycle-wide analytical framework in current research on the security of fine-tuned large language models (LLMs). It proposes the first comprehensive security framework spanning pre-, during-, and post-fine-tuning phases, systematically integrating attack vectors—including data poisoning, weight tampering, and agent manipulation—with corresponding defenses. Empirical cross-phase evaluations are conducted under consistent model architectures, hardware, and protocols. The study uncovers several counterintuitive findings: attack efficacy does not monotonically scale with model size; modern open-source LLMs exhibit heightened robustness against weight-editing attacks; cross-lingual backdoors completely fail on 1B–4B parameter models; and seemingly benign samples can compromise safety alignment in instruction-tuned models. These results demonstrate that single-phase defenses lack generalizability, necessitating coordinated, full-cycle protection strategies.

📝 Abstract

Background: Fine-tuning is central to adapting pre-trained Large Language Models (LLMs) to downstream tasks, but its reliance on training data, parameter updates, and reusable components opens entry points for attackers. Threats have evolved from data poisoning and weight tampering to agent manipulation and interface exploitation, yet existing reviews lack a unified framework spanning the full fine-tuning lifecycle. Objective: This paper presents a systematic survey of LLM fine-tuning security and establishes a lifecycle-based framework for comparing attacks and defenses, complemented by unified empirical evaluation. Methods: We divide attack and defense mechanisms into three phases by intervention timing: pre-tuning, during-tuning, and post-tuning. Within each phase, strategies are reviewed and contrasted to expose their evolution and limitations. Representative methods are then evaluated under a unified model, hardware, and protocol setup, with cross-phase experiments pairing attacks and defenses from different phases. Results: Attack effectiveness is highly model-dependent and non-monotonic with scale: weight-editing attacks effective on earlier models lose impact on modern open-source LLMs; cross-lingual backdoor transfer, reported as near-perfect at larger scales, fails entirely on tested 1B-4B models; and purely benign samples can compromise safety alignment in instruction-tuned models. Single-phase defenses rarely generalize across phases, and defense effectiveness depends jointly on model architecture and alignment state. Conclusion: We identify key open problems (configuration-robust defense, cross-phase defense composition, and embedding-space attacks beyond behavioral assumptions) and propose concrete future research directions.

Problem

Research questions and friction points this paper is trying to address.

LLM fine-tuning

security threats

lifecycle

attacks

defenses

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-tuning lifecycle

LLM security

cross-phase evaluation