From Essence to Defense: Adaptive Semantic-aware Watermarking for Embedding-as-a-Service Copyright Protection

πŸ“… 2025-12-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing watermarking methods for Embeddings-as-a-Service (EaaS) overlook semantic properties of embeddings, resulting in poor harmlessness, low imperceptibility, and embedding distribution distortion. To address this, we propose the first semantic-aware watermarking framework. Methodologically: (i) we design an adaptive watermark weighting mechanism based on Local Outlier Factor (LOF); (ii) we introduce LSH-driven semantic space partitioning to enable localized watermark injection; and (iii) we establish a joint evaluation framework integrating Detect-Sampling and dimensionality-reduction attacks. Experiments across four mainstream NLP datasets demonstrate significant improvements in verifiability, imperceptibility, harmlessness, and diversity: watermark signals are human-imperceptible, while the original embedding’s statistical distribution and downstream task performance are strictly preserved.

Technology Category

Application Category

πŸ“ Abstract
Benefiting from the superior capabilities of large language models in natural language understanding and generation, Embeddings-as-a-Service (EaaS) has emerged as a successful commercial paradigm on the web platform. However, prior studies have revealed that EaaS is vulnerable to imitation attacks. Existing methods protect the intellectual property of EaaS through watermarking techniques, but they all ignore the most important properties of embedding: semantics, resulting in limited harmlessness and stealthiness. To this end, we propose SemMark, a novel semantic-based watermarking paradigm for EaaS copyright protection. SemMark employs locality-sensitive hashing to partition the semantic space and inject semantic-aware watermarks into specific regions, ensuring that the watermark signals remain imperceptible and diverse. In addition, we introduce the adaptive watermark weight mechanism based on the local outlier factor to preserve the original embedding distribution. Furthermore, we propose Detect-Sampling and Dimensionality-Reduction attacks and construct four scenarios to evaluate the watermarking method. Extensive experiments are conducted on four popular NLP datasets, and SemMark achieves superior verifiability, diversity, stealthiness, and harmlessness.
Problem

Research questions and friction points this paper is trying to address.

Protects EaaS from imitation attacks via semantic-aware watermarking
Ensures watermarks are imperceptible and preserve embedding distribution
Evaluates method against novel attacks across diverse NLP datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-aware watermarking using locality-sensitive hashing
Adaptive watermark weight mechanism based on local outlier factor
Evaluation via Detect-Sampling and Dimensionality-Reduction attack scenarios
πŸ”Ž Similar Papers
No similar papers found.
H
Hao Li
Institute of Information Engineering, Chinese Academy of Sciences, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Y
Yubing Ren
Institute of Information Engineering, Chinese Academy of Sciences, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Yanan Cao
Yanan Cao
Institute of Information Engineering, Chinese Academy of Sciences
Y
Yingjie Li
Institute of Information Engineering, Chinese Academy of Sciences, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
F
Fang Fang
Institute of Information Engineering, Chinese Academy of Sciences, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
X
Xuebin Wang
Institute of Information Engineering, Chinese Academy of Sciences, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China