Adaptive and Robust Watermark for Generative Tabular Data

📅 2024-09-23
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Generative tabular data faces critical challenges in authenticity verification and risks of malicious misuse. Method: We propose a downstream-task-aware adaptive watermarking mechanism. Our approach partitions features into (key, value) column pairs; for each key column, it dynamically generates a random “green” value interval, constraining the corresponding value column to sample exclusively within this interval. Watermark embedding and efficient detection are achieved through feature-space partitioning, constraint-based generation, and statistical hypothesis testing. Contribution/Results: This work presents the first customizable, statistically provably secure watermarking scheme for tabular data, robust against multiple adversarial attacks—including noise injection, column deletion, and row resampling. Experiments demonstrate near-lossless preservation of statistical fidelity and downstream task performance post-embedding, alongside high detection accuracy and strong robustness.

Technology Category

Application Category

📝 Abstract
Recent developments in generative models have demonstrated its ability to create high-quality synthetic data. However, the pervasiveness of synthetic content online also brings forth growing concerns that it can be used for malicious purposes. To ensure the authenticity of the data, watermarking techniques have recently emerged as a promising solution due to their strong statistical guarantees. In this paper, we propose a flexible and robust watermarking mechanism for generative tabular data. Specifically, a data provider with knowledge of the downstream tasks can partition the feature space into pairs of $(key, value)$ columns. Within each pair, the data provider first uses elements in the $key$ column to generate a randomized set of ''green'' intervals, then encourages elements of the $value$ column to be in one of these ''green'' intervals. We show theoretically and empirically that the watermarked datasets (i) have negligible impact on the data quality and downstream utility, (ii) can be efficiently detected, and (iii) are robust against multiple attacks commonly observed in data science.
Problem

Research questions and friction points this paper is trying to address.

Ensures authenticity of synthetic tabular data via watermarking
Minimizes impact on data quality and downstream utility
Provides robustness against attacks and security against adversaries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Partitions feature space into (key, value) columns
Generates randomized green intervals for watermarking
Ensures robustness against attacks and maintains utility
🔎 Similar Papers
No similar papers found.
D
D. D. Ngo
University of Minnesota
D
Daniel Scott
JP Morgan AI Research
S
Saheed O. Obitayo
JP Morgan AI Research
V
Vamsi K. Potluru
JP Morgan AI Research
M
Manuela Veloso
JP Morgan AI Research