SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Large language models (LLMs) lack traceable provenance and reliable pre-training identity markers, hindering model attribution and accountability. Method: We propose SeedPrints—the first approach to leverage systematic biases introduced during parameter initialization via random seeds as intrinsic, seed-level fingerprints. SeedPrints extracts unique identifiers without requiring training, ensuring cross-training-stage stability, robustness to distributional shifts, and full-lifecycle traceability. It employs token-selection bias modeling, statistical significance testing, and cross-stage fingerprint alignment. Contribution/Results: Evaluated on LLaMA- and Qwen-style architectures, SeedPrints achieves >99% seed identification accuracy on large-scale pre-trained models and demonstrates long-term consistency. Benchmarking in realistic deployment scenarios validates its efficacy as the first lightweight, pre-training–effective solution for model provenance tracking and ownership verification.

Technology Category

Application Category

📝 Abstract

Fingerprinting Large Language Models (LLMs) is essential for provenance verification and model attribution. Existing methods typically extract post-hoc signatures based on training dynamics, data exposure, or hyperparameters -- properties that only emerge after training begins. In contrast, we propose a stronger and more intrinsic notion of LLM fingerprinting: SeedPrints, a method that leverages random initialization biases as persistent, seed-dependent identifiers present even before training. We show that untrained models exhibit reproducible token selection biases conditioned solely on their parameters at initialization. These biases are stable and measurable throughout training, enabling our statistical detection method to recover a model's lineage with high confidence. Unlike prior techniques, unreliable before convergence and vulnerable to distribution shifts, SeedPrints remains effective across all training stages and robust under domain shifts or parameter modifications. Experiments on LLaMA-style and Qwen-style models show that SeedPrints achieves seed-level distinguishability and can provide birth-to-lifecycle identity verification akin to a biometric fingerprint. Evaluations on large-scale pretrained models and fingerprinting benchmarks further confirm its effectiveness under practical deployment scenarios. These results suggest that initialization itself imprints a unique and persistent identity on neural language models, forming a true ''Galtonian'' fingerprint.

Problem

Research questions and friction points this paper is trying to address.

Identifies LLM training origins using initialization seed biases

Provides persistent model fingerprinting across all training stages

Enables robust lineage verification under domain shifts and modifications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages random initialization biases as identifiers

Uses token selection biases from untrained models

Provides stable detection across all training stages

🔎 Similar Papers

Whose LLM is it Anyway? Linguistic Comparison and LLM Attribution for GPT-3.5, GPT-4 and Bard