Model for Peanuts: Hijacking ML Models without Training Access is Possible

📅 2024-06-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work identifies a novel security threat—“model hijacking”—where attackers, without access to training data or model parameters, exploit input perturbations to activate unauthorized task capabilities (e.g., gender classification) latent within pre-trained models during inference, yielding illicit or unethical outputs. To address this, the authors propose SnatchML, the first method enabling cross-task classification hijacking via latent-space distance metrics. They further uncover model over-parameterization as an intrinsic vulnerability factor and introduce a dual defense: meta-forgetting training and lightweight model compression. Experiments across multiple mainstream pre-trained models demonstrate hijacking accuracies of 72%–89%. Meta-forgetting reduces hijacking success by over 60%, while compression-based defense achieves >95% efficacy. The study establishes foundational insights into inference-time model integrity and provides both attack methodology and practical mitigation strategies for secure ML deployment.

Technology Category

Application Category

📝 Abstract

The massive deployment of Machine Learning (ML) models has been accompanied by the emergence of several attacks that threaten their trustworthiness and raise ethical and societal concerns such as invasion of privacy, discrimination risks, and lack of accountability. Model hijacking is one of these attacks, where the adversary aims to hijack a victim model to execute a different task than its original one. Model hijacking can cause accountability and security risks since a hijacked model owner can be framed for having their model offering illegal or unethical services. Prior state-of-the-art works consider model hijacking as a training time attack, whereby an adversary requires access to the ML model training to execute their attack. In this paper, we consider a stronger threat model where the attacker has no access to the training phase of the victim model. Our intuition is that ML models, typically over-parameterized, might (unintentionally) learn more than the intended task for they are trained. We propose a simple approach for model hijacking at inference time named SnatchML to classify unknown input samples using distance measures in the latent space of the victim model to previously known samples associated with the hijacking task classes. SnatchML empirically shows that benign pre-trained models can execute tasks that are semantically related to the initial task. Surprisingly, this can be true even for hijacking tasks unrelated to the original task. We also explore different methods to mitigate this risk. We first propose a novel approach we call meta-unlearning, designed to help the model unlearn a potentially malicious task while training on the original task dataset. We also provide insights on over-parameterization as one possible inherent factor that makes model hijacking easier, and we accordingly propose a compression-based countermeasure against this attack.

Problem

Research questions and friction points this paper is trying to address.

Model hijacking without training access risks accountability and security.

Leveraging over-parameterized models to infer unintended tasks during inference.

Proposing mitigation methods like meta-unlearning and compression-based countermeasures.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free model hijacking via extra capacity

Meta-unlearning to mitigate malicious tasks

Compression counteracts over-parametrization risks

🔎 Similar Papers

Can't Hide Behind the API: Stealing Black-Box Commercial Embedding Models