PRIVET: Privacy Metric Based on Extreme Value Theory

📅 2025-10-28
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
Existing privacy evaluation methods predominantly rely on global metrics, rendering them incapable of quantifying privacy leakage risk at the individual sample level and lacking interpretability. To address this, we propose PRIVET—the first framework to integrate extreme value theory (EVT) into sample-level privacy assessment. PRIVET models the extreme-value distribution of nearest-neighbor distances in embedding space, assigning fine-grained privacy leakage scores to each generated sample. It enables unified, cross-modal evaluation (e.g., images and text) and supports both dataset-level and sample-level analysis. Experiments demonstrate that PRIVET robustly identifies memorized and privacy-leaking instances under challenging regimes—including high-dimensional spaces, small-sample settings, and model underfitting. Moreover, it uncovers fundamental limitations in current visual embedding models’ capacity to capture perceptual similarity.

Technology Category

Application Category

📝 Abstract
Deep generative models are often trained on sensitive data, such as genetic sequences, health data, or more broadly, any copyrighted, licensed or protected content. This raises critical concerns around privacy-preserving synthetic data, and more specifically around privacy leakage, an issue closely tied to overfitting. Existing methods almost exclusively rely on global criteria to estimate the risk of privacy failure associated to a model, offering only quantitative non interpretable insights. The absence of rigorous evaluation methods for data privacy at the sample-level may hinder the practical deployment of synthetic data in real-world applications. Using extreme value statistics on nearest-neighbor distances, we propose PRIVET, a generic sample-based, modality-agnostic algorithm that assigns an individual privacy leak score to each synthetic sample. We empirically demonstrate that PRIVET reliably detects instances of memorization and privacy leakage across diverse data modalities, including settings with very high dimensionality, limited sample sizes such as genetic data and even under underfitting regimes. We compare our method to existing approaches under controlled settings and show its advantage in providing both dataset level and sample level assessments through qualitative and quantitative outputs. Additionally, our analysis reveals limitations in existing computer vision embeddings to yield perceptually meaningful distances when identifying near-duplicate samples.
Problem

Research questions and friction points this paper is trying to address.

Measuring sample-level privacy leakage in deep generative models
Detecting memorization and privacy risks across diverse data types
Overcoming limitations of global privacy assessment methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses extreme value statistics on nearest-neighbor distances
Assigns individual privacy leak score to synthetic samples
Works across diverse data modalities and dimensionalities
🔎 Similar Papers
No similar papers found.
A
Antoine Szatkownik
Université Paris-Saclay, CNRS, INRIA, LISN, Gif-sur-Yvette, France
Aurélien Decelle
Aurélien Decelle
Research, Universidad Politécnica de Madrid
statistical physicsmachine learningBayesian inferenceartificial intelligence
B
Beatriz Seoane
Departamento de FĂ­sica TeĂłrica, Universidad Complutense de Madrid, Madrid, Spain
N
Nicolas Béreux
Université Paris-Saclay, CNRS, INRIA, LISN, Gif-sur-Yvette, France
L
Léo Planche
Université Paris-Saclay, CNRS, INRIA, LISN, Gif-sur-Yvette, France
Guillaume Charpiat
Guillaume Charpiat
INRIA (Saclay)
Artificial intelligencestatistical learningcomputer visionshape statisticsoptimization
B
Burak Yelmen
Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia
F
Flora Jay
Université Paris-Saclay, CNRS, INRIA, LISN, Gif-sur-Yvette, France
Cyril Furtlehner
Cyril Furtlehner
Inria
statistical physicsmachine learningcomplex systemstraffic forecasting