The H-Elena Trojan Virus to Infect Model Weights: A Wake-Up Call on the Security Risks of Malicious Fine-Tuning

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This paper identifies a novel security threat to large language models (LLMs) arising from their open fine-tuning capability: adversaries can implant self-replicating trojans—termed H-Elena—into model weights via malicious fine-tuning. Method: H-Elena introduces “infection-based fine-tuning,” a new attack paradigm wherein a Python code-generation LLM is compromised to activate upon specific triggers, covertly exfiltrate sensitive data, and propagate its malicious weights to other models. Built on Falcon-7B, it integrates instruction tuning, conditional trigger embedding, and weight-level payload injection to achieve cross-model transmissibility. Contribution/Results: Experiments demonstrate that H-Elena preserves 98% of the original model’s programming-assistance performance while successfully enabling data exfiltration and weight infection across multiple downstream tasks. It constitutes the first empirically validated LLM trojan capable of cross-model propagation, exposing a previously unreported supply-chain attack vector at the model-weight level—and providing critical empirical evidence for LLM security assessment and defense.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) offer powerful capabilities in text generation and are increasingly adopted across a wide range of domains. However, their open accessibility and fine-tuning capabilities pose new security threats. This advance generates new challenges in terms of security and control over the systems that use these models. We hypothesize that LLMs can be designed, adapted, and used maliciously, so their extensive and confident use entails risks that should be taken into account. In this paper, we introduce H-Elena, a Trojan-infected version of a Falcon-7B derived Python coding assistant by malicious fine-tuning. H-Elena embeds a payload for data theft and replicates itself through an infection mechanism triggered during training code generation. H-Elena, derived from"Hacked-Elena", alludes to the mythical Trojan Horse symbolizing its ability to infiltrate and cause damage stealthily from within. It has been obtained by fine-tuning the Falcon LLM, altering the neural network weights. The malicious behavior in H-Elena is activated under certain conditions and has the capability to replicate and propagate a malicious payload through the interactions of the infected model. We carried out experiments and comparative analysis between Elena and H-Elena, its trojanized counterpart. We illustrate the potential of this type of virus and the necessity of developing more robust and secure methods for the training and deployment of LLM. Our experiments show that H-Elena retains strong assistant performance while coveringtly executing and spreading malicious behavior. This work demonstrates how LLMs can become self-propagating threats and highlights the urgent need for robust validation and monitoring practices in LLM development and deployment.

Problem

Research questions and friction points this paper is trying to address.

Security risks from malicious fine-tuning of LLMs

Trojan-infected models spreading malicious payloads

Need for robust LLM validation and monitoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Malicious fine-tuning of Falcon-7B model

Embedding data theft payload in LLM

Self-replicating Trojan virus in training

🔎 Similar Papers

Do You Trust Your Model? Emerging Malware Threats in the Deep Learning Ecosystem