Fingerprinting LLMs via Prompt Injection

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM provenance detection methods rely on pre-embedded watermarks or handcrafted prompts, rendering them ineffective against post-training modifications (e.g., fine-tuning, quantization) applied to already-deployed models and severely limiting robustness. This paper introduces LLMPrint, the first framework to exploit the inherent vulnerability of LLMs to prompt injection attacks. It employs an optimization algorithm to generate model-specific, post-processing-resilient fingerprint prompts that reliably elicit distinctive token-level preferences. The method operates consistently under both gray-box and black-box settings. Provenance attribution is achieved via statistically rigorous hypothesis testing. Extensive evaluation across five base models and over 700 variants demonstrates LLMPrint’s superior performance: it achieves high true-positive rates while driving false-positive rates nearly to zero—significantly outperforming state-of-the-art alternatives.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero.
Problem

Research questions and friction points this paper is trying to address.

Detecting model provenance for post-processed LLMs without pre-embedded signals
Creating robust fingerprints using optimized prompt injection vulnerabilities
Verifying model relationships across gray-box and black-box settings statistically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses prompt injection to create model fingerprints
Optimizes prompts for consistent token preferences
Works in both gray-box and black-box settings
🔎 Similar Papers
No similar papers found.