Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

📅 2026-01-20

📈 Citations: 3

✨ Influential: 0

career value

261K/year

🤖 AI Summary

Although state-of-the-art large language models employ output-level safety mechanisms, they may still inadvertently leak harmful knowledge through indirect prompting, enabling open-source models—after fine-tuning—to reconstruct hazardous capabilities, thereby posing ecosystem-level risks. This work presents the first systematic investigation of such cross-model capability transfer threats and introduces a three-stage elicitation attack framework: by crafting benign prompts that are semantically proximate to harmful tasks, adversaries can extract implicit hazardous information from safeguarded models and use it to fine-tune open-source counterparts. Experiments on dangerous chemical synthesis tasks demonstrate that this approach recovers approximately 40% of the performance gap between protected and unrestricted models, with attack efficacy significantly amplified by both the capability of the frontier model and the scale of fine-tuning data, thereby challenging the adequacy of current safety paradigms.

Technology Category

Application Category

📝 Abstract

Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through elicitation attacks. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40% of the capability gap between the base open-source model and an unrestricted frontier model. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards.

Problem

Research questions and friction points this paper is trying to address.

elicitation attacks

safeguarded models

harmful capabilities

fine-tuning

ecosystem risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

elicitation attacks

safeguarded outputs

fine-tuning