🤖 AI Summary
To address the poor robustness of existing watermarking methods against fine-tuning and knowledge distillation—and the inability of fingerprinting techniques to provide provable ownership—in large language model (LLM) intellectual property protection, this paper proposes a subspace-anchored watermarking framework. It is the first to embed multi-bit signatures into orthogonal subspaces of model hidden-layer representations, enabling verifiable watermark detection under both white-box and black-box settings via anchor-sample-driven subspace alignment and orthogonal vector encoding. The method achieves high imperceptibility, strong robustness, and minimal degradation in model functionality. Extensive experiments across six mainstream LLMs and multiple benchmarks demonstrate its superiority over 11 state-of-the-art baselines, achieving high watermark accuracy and maintaining reliable detection performance under diverse adversarial attacks, including knowledge distillation and fine-tuning.
📝 Abstract
Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, demonstrating human-level performance in text generation, reasoning, and question answering. However, training such models requires substantial computational resources, large curated datasets, and sophisticated alignment procedures. As a result, they constitute highly valuable intellectual property (IP) assets that warrant robust protection mechanisms. Existing IP protection approaches suffer from critical limitations. Model fingerprinting techniques can identify model architectures but fail to establish ownership of specific model instances. In contrast, traditional backdoor-based watermarking methods embed behavioral anomalies that can be easily removed through common post-processing operations such as fine-tuning or knowledge distillation. We propose SEAL, a subspace-anchored watermarking framework that embeds multi-bit signatures directly into the model's latent representational space, supporting both white-box and black-box verification scenarios. Our approach leverages model editing techniques to align the hidden representations of selected anchor samples with predefined orthogonal bit vectors. This alignment embeds the watermark while preserving the model's original factual predictions, rendering the watermark functionally harmless and stealthy. We conduct comprehensive experiments on multiple benchmark datasets and six prominent LLMs, comparing SEAL with 11 existing fingerprinting and watermarking methods to demonstrate its superior effectiveness, fidelity, efficiency, and robustness. Furthermore, we evaluate SEAL under potential knowledgeable attacks and show that it maintains strong verification performance even when adversaries possess knowledge of the watermarking mechanism and the embedded signatures.