dX-Privacy for Text and the Curse of Dimensionality

📅 2024-11-21

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing multidimensional Laplace mechanisms for text d_X-privacy suffer from semantic mismatch: under high-dimensional word embeddings, noise-perturbed outputs concentrate either near the original word or at semantically distant points, rarely yielding plausible synonyms—thus impeding simultaneous privacy preservation and semantic utility. This work first identifies the geometric root of this phenomenon—the causal relationship between the nearest-neighbor distance gap in embedding spaces and the distribution of the dot product between Laplace noise and embedding vectors. Leveraging high-dimensional probability theory and tail-bound analysis, we rigorously derive the distributional properties and moment bounds of this dot product. Building on this analysis, we propose the first provably effective post-processing correction method. Experiments demonstrate that our approach significantly improves the semantic plausibility and downstream task performance of perturbed words while preserving formal (ε,δ)-d_X-privacy guarantees.

Technology Category

Application Category

📝 Abstract

A widely used method to ensure privacy of unstructured text data is the multidimensional Laplace mechanism for $d_X$-privacy, which is a relaxation of differential privacy for metric spaces. We identify an intriguing peculiarity of this mechanism. When applied on a word-by-word basis, the mechanism either outputs the original word, or completely dissimilar words, and very rarely any semantically similar words. We investigate this observation in detail, and tie it to the fact that the distance of the nearest neighbor of a word in any word embedding model (which are high-dimensional) is much larger than the relative difference in distances to any of its two consecutive neighbors. We also show that the dot product of the multidimensional Laplace noise vector with any word embedding plays a crucial role in designating the nearest neighbor. We derive the distribution, moments and tail bounds of this dot product. We further propose a fix as a post-processing step, which satisfactorily removes the above-mentioned issue.

Problem

Research questions and friction points this paper is trying to address.

Analyzing word-level semantic distortion in d_X-privacy mechanisms

Investigating high-dimensional geometry effects on text anonymization

Proposing post-processing fix for nearest neighbor leakage issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multidimensional Laplace mechanism for d_X-privacy

Analyzes nearest neighbor distances in word embeddings

Proposes post-processing fix for semantic similarity issue

🔎 Similar Papers

Robust Utility-Preserving Text Anonymization Based on Large Language Models