Fingerprinting LLMs via Prompt Injection

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing LLM provenance detection methods rely on pre-embedded watermarks or handcrafted prompts, rendering them ineffective against post-training modifications (e.g., fine-tuning, quantization) applied to already-deployed models and severely limiting robustness. This paper introduces LLMPrint, the first framework to exploit the inherent vulnerability of LLMs to prompt injection attacks. It employs an optimization algorithm to generate model-specific, post-processing-resilient fingerprint prompts that reliably elicit distinctive token-level preferences. The method operates consistently under both gray-box and black-box settings. Provenance attribution is achieved via statistically rigorous hypothesis testing. Extensive evaluation across five base models and over 700 variants demonstrates LLMPrint’s superior performance: it achieves high true-positive rates while driving false-positive rates nearly to zero—significantly outperforming state-of-the-art alternatives.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero.

Problem

Research questions and friction points this paper is trying to address.

Detecting model provenance for post-processed LLMs without pre-embedded signals

Creating robust fingerprints using optimized prompt injection vulnerabilities

Verifying model relationships across gray-box and black-box settings statistically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses prompt injection to create model fingerprints

Optimizes prompts for consistent token preferences

Works in both gray-box and black-box settings

🔎 Similar Papers

LLMmap: Fingerprinting For Large Language Models