Generate-then-Verify: Reconstructing Data from Limited Published Statistics

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This paper addresses partial reconstruction of tabular data under sparse summary statistics (e.g., marginal totals), aiming to identify row/column subsets that hold with certainty—i.e., in all datasets consistent with the released statistics—rather than fully recovering the original data. We introduce the “generate-verify” paradigm, framing partial reconstruction as a problem of generating verifiable assertions, thereby departing from conventional full-reconstruction assumptions. Our method employs a two-stage integer linear programming framework: first generating candidate assertions, then rigorously verifying their validity via feasibility analysis to guarantee deterministic correctness. Experiments on U.S. household-level census microdata demonstrate that hundreds of provably correct records can be extracted with high confidence using only a small number of released marginals. These results expose an underappreciated risk of deterministic privacy leakage in sparse statistical disclosure scenarios.

Technology Category

Application Category

📝 Abstract

We study the problem of reconstructing tabular data from aggregate statistics, in which the attacker aims to identify interesting claims about the sensitive data that can be verified with 100% certainty given the aggregates. Successful attempts in prior work have conducted studies in settings where the set of published statistics is rich enough that entire datasets can be reconstructed with certainty. In our work, we instead focus on the regime where many possible datasets match the published statistics, making it impossible to reconstruct the entire private dataset perfectly (i.e., when approaches in prior work fail). We propose the problem of partial data reconstruction, in which the goal of the adversary is to instead output a $ extit{subset}$ of rows and/or columns that are $ extit{guaranteed to be correct}$. We introduce a novel integer programming approach that first $ extbf{generates}$ a set of claims and then $ extbf{verifies}$ whether each claim holds for all possible datasets consistent with the published aggregates. We evaluate our approach on the housing-level microdata from the U.S. Decennial Census release, demonstrating that privacy violations can still persist even when information published about such data is relatively sparse.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing partial tabular data from limited aggregate statistics

Identifying verifiable claims about sensitive data using published aggregates

Addressing privacy violations in sparse data releases via generate-then-verify

Innovation

Methods, ideas, or system contributions that make the work stand out.

Partial data reconstruction from sparse statistics

Generate-then-verify with integer programming

Guaranteed correct subset extraction

🔎 Similar Papers

No similar papers found.