Large Language Models Can Verbatim Reproduce Long Malicious Sequences

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work identifies a novel backdoor threat in large language models (LLMs) capable of precisely generating long malicious sequences—particularly in high-accuracy, short-output (≤100 characters) plain-text scenarios involving hardcoded secrets (e.g., API keys). Method: We systematically demonstrate that such backdoors can be stably injected and reliably activated via trigger-response pair poisoning during LoRA fine-tuning. We validate the attack end-to-end on Gemini Nano 1.8B. To counter it, we propose “benign fine-tuning”—a lightweight defense requiring only a small set of harmless samples to eradicate the backdoor while preserving model functionality. Contribution/Results: This is the first work to establish both the feasibility and controllability of backdoor injection in LoRA-finetuned LLMs for long-sequence plain-text outputs. It provides the first verifiable attack framework for such threats and an efficient, empirically validated repair method—advancing both theoretical understanding and practical mitigation strategies for secure LLM fine-tuning.

Technology Category

Application Category

📝 Abstract

Backdoor attacks on machine learning models have been extensively studied, primarily within the computer vision domain. Originally, these attacks manipulated classifiers to generate incorrect outputs in the presence of specific, often subtle, triggers. This paper re-examines the concept of backdoor attacks in the context of Large Language Models (LLMs), focusing on the generation of long, verbatim sequences. This focus is crucial as many malicious applications of LLMs involve the production of lengthy, context-specific outputs. For instance, an LLM might be backdoored to produce code with a hard coded cryptographic key intended for encrypting communications with an adversary, thus requiring extreme output precision. We follow computer vision literature and adjust the LLM training process to include malicious trigger-response pairs into a larger dataset of benign examples to produce a trojan model. We find that arbitrary verbatim responses containing hard coded keys of $leq100$ random characters can be reproduced when triggered by a target input, even for low rank optimization settings. Our work demonstrates the possibility of backdoor injection in LoRA fine-tuning. Having established the vulnerability, we turn to defend against such backdoors. We perform experiments on Gemini Nano 1.8B showing that subsequent benign fine-tuning effectively disables the backdoors in trojan models.

Problem

Research questions and friction points this paper is trying to address.

Examines backdoor attacks in Large Language Models (LLMs)

Focuses on generating long, verbatim malicious sequences

Demonstrates and defends against backdoor injection in LoRA fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adjust LLM training with trigger-response pairs

Enable verbatim malicious sequence reproduction

Disable backdoors via benign fine-tuning

🔎 Similar Papers

No similar papers found.

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Machine Learning Engineer, AI Coding Tools

ByteDance

圣何塞

Authors to Follow