🤖 AI Summary
Existing LLM provenance detection methods rely on pre-embedded watermarks or handcrafted prompts, rendering them ineffective against post-training modifications (e.g., fine-tuning, quantization) applied to already-deployed models and severely limiting robustness. This paper introduces LLMPrint, the first framework to exploit the inherent vulnerability of LLMs to prompt injection attacks. It employs an optimization algorithm to generate model-specific, post-processing-resilient fingerprint prompts that reliably elicit distinctive token-level preferences. The method operates consistently under both gray-box and black-box settings. Provenance attribution is achieved via statistically rigorous hypothesis testing. Extensive evaluation across five base models and over 700 variants demonstrates LLMPrint’s superior performance: it achieves high true-positive rates while driving false-positive rates nearly to zero—significantly outperforming state-of-the-art alternatives.
📝 Abstract
Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero.