Fundamental Limitations in Defending LLM Finetuning APIs

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work exposes a fundamental limitation of current pointwise-detection-based defenses for large language model fine-tuning APIs. Specifically, it targets safety mechanisms designed to identify harmful training or inference samples and introduces, for the first time, a “pointwise-undetectable” fine-tuning attack: by semantically and syntactically reusing high-entropy benign outputs, the attack stealthily injects harmful knowledge that evades single-sample discriminators. Methodologically, the approach integrates output entropy modeling, low-perplexity benign sample selection, multi-round adversarial fine-tuning, and a custom-enhanced monitoring system—successfully eliciting harmful multiple-choice responses on the OpenAI fine-tuning API while fully bypassing both standard and strengthened detection systems. The core contributions are: (i) a theoretical proof establishing the inherent infeasibility of pointwise detection in principle, and (ii) the identification of “entropy reuse” as a novel, intrinsic pathway for fine-tuning abuse.

Technology Category

Application Category

📝 Abstract

LLM developers have imposed technical interventions to prevent fine-tuning misuse attacks, attacks where adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs (e.g. semantic or syntactic variations) to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. We encourage the community to develop defences that tackle the fundamental limitations we uncover in pointwise fine-tuning API defences.

Problem

Research questions and friction points this paper is trying to address.

Defending LLM fine-tuning APIs

Limitations of pointwise detection

Covert transmission of dangerous knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed undetectable fine-tuning attacks

Repurposed benign model outputs entropy

Evaded enhanced monitoring system successfully

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models