π€ AI Summary
Existing watermarking methods for Embeddings-as-a-Service (EaaS) overlook semantic properties of embeddings, resulting in poor harmlessness, low imperceptibility, and embedding distribution distortion. To address this, we propose the first semantic-aware watermarking framework. Methodologically: (i) we design an adaptive watermark weighting mechanism based on Local Outlier Factor (LOF); (ii) we introduce LSH-driven semantic space partitioning to enable localized watermark injection; and (iii) we establish a joint evaluation framework integrating Detect-Sampling and dimensionality-reduction attacks. Experiments across four mainstream NLP datasets demonstrate significant improvements in verifiability, imperceptibility, harmlessness, and diversity: watermark signals are human-imperceptible, while the original embeddingβs statistical distribution and downstream task performance are strictly preserved.
π Abstract
Benefiting from the superior capabilities of large language models in natural language understanding and generation, Embeddings-as-a-Service (EaaS) has emerged as a successful commercial paradigm on the web platform. However, prior studies have revealed that EaaS is vulnerable to imitation attacks. Existing methods protect the intellectual property of EaaS through watermarking techniques, but they all ignore the most important properties of embedding: semantics, resulting in limited harmlessness and stealthiness. To this end, we propose SemMark, a novel semantic-based watermarking paradigm for EaaS copyright protection. SemMark employs locality-sensitive hashing to partition the semantic space and inject semantic-aware watermarks into specific regions, ensuring that the watermark signals remain imperceptible and diverse. In addition, we introduce the adaptive watermark weight mechanism based on the local outlier factor to preserve the original embedding distribution. Furthermore, we propose Detect-Sampling and Dimensionality-Reduction attacks and construct four scenarios to evaluate the watermarking method. Extensive experiments are conducted on four popular NLP datasets, and SemMark achieves superior verifiability, diversity, stealthiness, and harmlessness.